{
    "version": "https://jsonfeed.org/version/1",
    "title": "hpc.social - Aggregated Personal Blog",
    "home_page_url": "https://hpc.social/personal-blog/",
    "feed_url": "https://hpc.social/personal-blog/feed.json",
    "description": "Shared personal experiences and stories",
    "icon": "https://hpc.social/personal-blog/assets/images/apple-touch-icon.png",
    "favicon": "https://hpc.social/personal-blog/assets/images/favicon.png",
    "expired": false,
    
    "author":  {
        "name": "hpc.social",
        "url": null,
        "avatar": null
    },
    
"items": [
    
        {
            "id": "https://hpc.social/personal-blog/2026/opensearch-transform-job-the-case-of-the-silent-failure-and-the-ghost-key/",
            "title": "OpenSearch Transform Job- The Case of the Silent Failure and the Ghost Key",
            "summary": null,
            "content_text": "Debugging OpenSearch Transform jobs can feel like searching for a needle in a haystack, especially when the error messages are generic. This post chronicles a recent debugging journey, highlighting common pitfalls and the ultimate solution to a persistently failing transform job.The Problem: Summarizing XRootD Stash DataOur goal was straightforward: aggregate XRootD stash access logs (xrd-stash*) into a daily summary index (osdf-summary-{year}). This involved grouping by several file path components, server details, and user domains, then calculating sums, averages, and counts of metrics like filesize, read, and write.Here is a snippet of the initial (problematic) transform configuration:{  \"transform\": {    \"transform_id\": \"osdf-summary-2022\",    \"description\": \"OSDF summary transform for year 2022\",    \"source_index\": \"xrd-stash*\",    \"target_index\": \"osdf-summary-2022\",    \"page_size\": 1000,    \"groups\": [      {        \"date_histogram\": {          \"source_field\": \"@timestamp\",          \"target_field\": \"@timestamp\",          \"calendar_interval\": \"1d\"        }      },      {        \"terms\": {          \"source_field\": \"dirname1.keyword\",          \"target_field\": \"dirname1\"        }      }    ],    \"aggregations\": {      \"filesize_sum\": { \"sum\": { \"field\": \"filesize\" } },      \"filesize_avg\": { \"avg\": { \"field\": \"filesize\" } }    }  }}The Symptoms: Generic Errors and TimeoutsThe transform job kept failing with a rather unhelpful message in its metadata:{  \"status\": \"failed\",  \"failure_reason\": \"Failed to index the documents\",  \"stats\": {    \"pages_processed\": 96,    \"documents_processed\": 89737708,    \"documents_indexed\": 96000,    \"index_time_in_millis\": 44733,    \"search_time_in_millis\": 1715612  }}Notice the high search_time_in_millis compared to index_time_in_millis. This was a critical clue that the aggregation phase was struggling.Further attempts to debug with _explain or custom composite aggregation queries often resulted in:  502 Bad Gateway / timed_out: The query was too resource-intensive for the cluster to handle.  illegal_argument_exception: Missing value for [after.date_histogram]: A mismatch in how the after_key was structured versus the sources in the composite aggregation.  illegal_argument_exception: Invalid value for [after.site], expected comparable, got [null]: The transform was getting stuck on null values within its grouping keys.The Debugging Journey and DiscoveriesThrough a series of focused queries and iterative refinements, we uncovered several interconnected issues.1. Composite Aggregation Challenges and the “Ghost Key”Our composite aggregation debugging queries kept failing. This was traced to:  Syntax mismatches: names in the after key must exactly match the names defined in sources (for example, @timestamp must match @timestamp).  null values in after_key: terms aggregations can fail when after_key includes null, unless handled explicitly.Then came the key finding: a direct search for documents matching the transform’s after_key yielded zero results. The transform was trying to resume from a state that no longer existed in source data.2. The Real Culprit: Unparsed “Garbage” DataAn inverse query (documents missing expected fields) revealed records like:{  \"_index\": \"xrd-stash-ilm-000037.reindexed\",  \"_id\": \"cAqUoH4BOTrVvgqCSyKq\",  \"_source\": {    \"message\": \"GET / HTTP/1.1\\n\",    \"@timestamp\": \"2022-01-28T12:06:20.222Z\",    \"host\": \"ec2-3-110-169-111.ap-south-1.compute.amazonaws.amazonaws.com\",    \"tags\": [\"_grokparsefailure\"]  }}These were logs that failed parsing and were actually web traffic hitting the server, not XRootD stash operations. They lacked key transform fields like logical_dirname, filesize, and server.When the transform encountered enough of these records, grouping keys became null. Combined with malformed or very long field values, the composite aggregation became unstable and hit timeouts.3. Precision for PetaByte-Scale DataNot a crash cause, but still important: float is not precise enough for large sums at petabyte scale.Solution: use double for sums/averages and long for counts.The Ultimate Solution: Resilience and PrecisionThe final, robust fix used multiple changes together.1. Stop and Delete Stale StateStop the transform and delete the target index to clear bad transform/index state.POST _plugins/_transform/osdf-summary-2022/_stopDELETE osdf-summary-20222. Recreate Index with Explicit High-Precision MappingsPUT osdf-summary-2022{  \"mappings\": {    \"properties\": {      \"@timestamp\": { \"type\": \"date\" },      \"dirname1\": { \"type\": \"keyword\" },      \"logical_dirname\": { \"type\": \"keyword\" },      \"filesize_sum\": { \"type\": \"double\" },      \"filesize_avg\": { \"type\": \"double\" },      \"filesize_count\": { \"type\": \"long\" },      \"doc_count\": { \"type\": \"long\" }    }  }}3. Add Intelligent Filtering in data_selection_query  Exclude _grokparsefailure events.  Require existence of critical grouping fields.  Add script guards against empty or oversized keyword values.\"data_selection_query\": {  \"bool\": {    \"must\": [      {        \"range\": {          \"@timestamp\": {            \"gte\": f\"{year}-01-01T00:00:00Z\",            \"lt\": f\"{year + 1}-01-01T00:00:00Z\"          }        }      }    ],    \"must_not\": [      { \"term\": { \"tags\": \"_grokparsefailure\" } }    ],    \"filter\": [      { \"exists\": { \"field\": \"logical_dirname.keyword\" } },      {        \"script\": {          \"script\": \"doc['logical_dirname.keyword'].size() &gt; 0 &amp;&amp; doc['logical_dirname.keyword'].value.length() &lt; 1000\"        }      }    ]  }}4. Reduce page_sizeLowering page_size from 1000 to 50 significantly reduced memory pressure per composite aggregation page and helped avoid 502 Bad Gateway failures.5. Restart the TransformAfter recreating the index and updating the transform definition, restart the job.ConclusionBy combining explicit mappings, stronger filtering, smaller pagination, and a reset of stale transform state, the transform ran reliably and produced accurate summaries without repeated failure loops.This debugging story reinforced a key lesson: robust pipelines are not just about handling valid data, but actively excluding invalid or malformed records before they poison downstream aggregation logic.",
            "content_html": "<p>Debugging OpenSearch Transform jobs can feel like searching for a needle in a haystack, especially when the error messages are generic. This post chronicles a recent debugging journey, highlighting common pitfalls and the ultimate solution to a persistently failing transform job.</p><h2 id=\"the-problem-summarizing-xrootd-stash-data\">The Problem: Summarizing XRootD Stash Data</h2><p>Our goal was straightforward: aggregate XRootD stash access logs (<code class=\"language-plaintext highlighter-rouge\">xrd-stash*</code>) into a daily summary index (<code class=\"language-plaintext highlighter-rouge\">osdf-summary-{year}</code>). This involved grouping by several file path components, server details, and user domains, then calculating sums, averages, and counts of metrics like <code class=\"language-plaintext highlighter-rouge\">filesize</code>, <code class=\"language-plaintext highlighter-rouge\">read</code>, and <code class=\"language-plaintext highlighter-rouge\">write</code>.</p><p>Here is a snippet of the initial (problematic) transform configuration:</p><div class=\"language-json highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"p\">{</span><span class=\"w\">  </span><span class=\"nl\">\"transform\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">    </span><span class=\"nl\">\"transform_id\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"osdf-summary-2022\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"description\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"OSDF summary transform for year 2022\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"source_index\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"xrd-stash*\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"target_index\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"osdf-summary-2022\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"page_size\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"mi\">1000</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"groups\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">[</span><span class=\"w\">      </span><span class=\"p\">{</span><span class=\"w\">        </span><span class=\"nl\">\"date_histogram\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">          </span><span class=\"nl\">\"source_field\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"@timestamp\"</span><span class=\"p\">,</span><span class=\"w\">          </span><span class=\"nl\">\"target_field\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"@timestamp\"</span><span class=\"p\">,</span><span class=\"w\">          </span><span class=\"nl\">\"calendar_interval\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"1d\"</span><span class=\"w\">        </span><span class=\"p\">}</span><span class=\"w\">      </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"p\">{</span><span class=\"w\">        </span><span class=\"nl\">\"terms\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">          </span><span class=\"nl\">\"source_field\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"dirname1.keyword\"</span><span class=\"p\">,</span><span class=\"w\">          </span><span class=\"nl\">\"target_field\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"dirname1\"</span><span class=\"w\">        </span><span class=\"p\">}</span><span class=\"w\">      </span><span class=\"p\">}</span><span class=\"w\">    </span><span class=\"p\">],</span><span class=\"w\">    </span><span class=\"nl\">\"aggregations\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">      </span><span class=\"nl\">\"filesize_sum\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"sum\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"field\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"filesize\"</span><span class=\"w\"> </span><span class=\"p\">}</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"filesize_avg\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"avg\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"field\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"filesize\"</span><span class=\"w\"> </span><span class=\"p\">}</span><span class=\"w\"> </span><span class=\"p\">}</span><span class=\"w\">    </span><span class=\"p\">}</span><span class=\"w\">  </span><span class=\"p\">}</span><span class=\"w\"></span><span class=\"p\">}</span><span class=\"w\"></span></code></pre></div></div><h2 id=\"the-symptoms-generic-errors-and-timeouts\">The Symptoms: Generic Errors and Timeouts</h2><p>The transform job kept failing with a rather unhelpful message in its metadata:</p><div class=\"language-json highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"p\">{</span><span class=\"w\">  </span><span class=\"nl\">\"status\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"failed\"</span><span class=\"p\">,</span><span class=\"w\">  </span><span class=\"nl\">\"failure_reason\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"Failed to index the documents\"</span><span class=\"p\">,</span><span class=\"w\">  </span><span class=\"nl\">\"stats\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">    </span><span class=\"nl\">\"pages_processed\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"mi\">96</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"documents_processed\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"mi\">89737708</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"documents_indexed\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"mi\">96000</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"index_time_in_millis\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"mi\">44733</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"search_time_in_millis\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"mi\">1715612</span><span class=\"w\">  </span><span class=\"p\">}</span><span class=\"w\"></span><span class=\"p\">}</span><span class=\"w\"></span></code></pre></div></div><p>Notice the high <code class=\"language-plaintext highlighter-rouge\">search_time_in_millis</code> compared to <code class=\"language-plaintext highlighter-rouge\">index_time_in_millis</code>. This was a critical clue that the aggregation phase was struggling.</p><p>Further attempts to debug with <code class=\"language-plaintext highlighter-rouge\">_explain</code> or custom composite aggregation queries often resulted in:</p><ul>  <li><code class=\"language-plaintext highlighter-rouge\">502 Bad Gateway / timed_out</code>: The query was too resource-intensive for the cluster to handle.</li>  <li><code class=\"language-plaintext highlighter-rouge\">illegal_argument_exception: Missing value for [after.date_histogram]</code>: A mismatch in how the <code class=\"language-plaintext highlighter-rouge\">after_key</code> was structured versus the <code class=\"language-plaintext highlighter-rouge\">sources</code> in the composite aggregation.</li>  <li><code class=\"language-plaintext highlighter-rouge\">illegal_argument_exception: Invalid value for [after.site], expected comparable, got [null]</code>: The transform was getting stuck on <code class=\"language-plaintext highlighter-rouge\">null</code> values within its grouping keys.</li></ul><h2 id=\"the-debugging-journey-and-discoveries\">The Debugging Journey and Discoveries</h2><p>Through a series of focused queries and iterative refinements, we uncovered several interconnected issues.</p><h3 id=\"1-composite-aggregation-challenges-and-the-ghost-key\">1. Composite Aggregation Challenges and the “Ghost Key”</h3><p>Our composite aggregation debugging queries kept failing. This was traced to:</p><ul>  <li>Syntax mismatches: names in the <code class=\"language-plaintext highlighter-rouge\">after</code> key must exactly match the names defined in <code class=\"language-plaintext highlighter-rouge\">sources</code> (for example, <code class=\"language-plaintext highlighter-rouge\">@timestamp</code> must match <code class=\"language-plaintext highlighter-rouge\">@timestamp</code>).</li>  <li><code class=\"language-plaintext highlighter-rouge\">null</code> values in <code class=\"language-plaintext highlighter-rouge\">after_key</code>: terms aggregations can fail when <code class=\"language-plaintext highlighter-rouge\">after_key</code> includes <code class=\"language-plaintext highlighter-rouge\">null</code>, unless handled explicitly.</li></ul><p>Then came the key finding: a direct search for documents matching the transform’s <code class=\"language-plaintext highlighter-rouge\">after_key</code> yielded zero results. The transform was trying to resume from a state that no longer existed in source data.</p><h3 id=\"2-the-real-culprit-unparsed-garbage-data\">2. The Real Culprit: Unparsed “Garbage” Data</h3><p>An inverse query (documents missing expected fields) revealed records like:</p><div class=\"language-json highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"p\">{</span><span class=\"w\">  </span><span class=\"nl\">\"_index\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"xrd-stash-ilm-000037.reindexed\"</span><span class=\"p\">,</span><span class=\"w\">  </span><span class=\"nl\">\"_id\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"cAqUoH4BOTrVvgqCSyKq\"</span><span class=\"p\">,</span><span class=\"w\">  </span><span class=\"nl\">\"_source\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">    </span><span class=\"nl\">\"message\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"GET / HTTP/1.1</span><span class=\"se\">\\n</span><span class=\"s2\">\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"@timestamp\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"2022-01-28T12:06:20.222Z\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"host\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"ec2-3-110-169-111.ap-south-1.compute.amazonaws.amazonaws.com\"</span><span class=\"p\">,</span><span class=\"w\">    </span><span class=\"nl\">\"tags\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">[</span><span class=\"s2\">\"_grokparsefailure\"</span><span class=\"p\">]</span><span class=\"w\">  </span><span class=\"p\">}</span><span class=\"w\"></span><span class=\"p\">}</span><span class=\"w\"></span></code></pre></div></div><p>These were logs that failed parsing and were actually web traffic hitting the server, not XRootD stash operations. They lacked key transform fields like <code class=\"language-plaintext highlighter-rouge\">logical_dirname</code>, <code class=\"language-plaintext highlighter-rouge\">filesize</code>, and <code class=\"language-plaintext highlighter-rouge\">server</code>.</p><p>When the transform encountered enough of these records, grouping keys became <code class=\"language-plaintext highlighter-rouge\">null</code>. Combined with malformed or very long field values, the composite aggregation became unstable and hit timeouts.</p><h3 id=\"3-precision-for-petabyte-scale-data\">3. Precision for PetaByte-Scale Data</h3><p>Not a crash cause, but still important: <code class=\"language-plaintext highlighter-rouge\">float</code> is not precise enough for large sums at petabyte scale.</p><p>Solution: use <code class=\"language-plaintext highlighter-rouge\">double</code> for sums/averages and <code class=\"language-plaintext highlighter-rouge\">long</code> for counts.</p><h2 id=\"the-ultimate-solution-resilience-and-precision\">The Ultimate Solution: Resilience and Precision</h2><p>The final, robust fix used multiple changes together.</p><h3 id=\"1-stop-and-delete-stale-state\">1. Stop and Delete Stale State</h3><p>Stop the transform and delete the target index to clear bad transform/index state.</p><div class=\"language-json highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"err\">POST</span><span class=\"w\"> </span><span class=\"err\">_plugins/_transform/osdf-summary</span><span class=\"mi\">-2022</span><span class=\"err\">/_stop</span><span class=\"w\"></span><span class=\"err\">DELETE</span><span class=\"w\"> </span><span class=\"err\">osdf-summary</span><span class=\"mi\">-2022</span><span class=\"w\"></span></code></pre></div></div><h3 id=\"2-recreate-index-with-explicit-high-precision-mappings\">2. Recreate Index with Explicit High-Precision Mappings</h3><div class=\"language-json highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"err\">PUT</span><span class=\"w\"> </span><span class=\"err\">osdf-summary</span><span class=\"mi\">-2022</span><span class=\"w\"></span><span class=\"p\">{</span><span class=\"w\">  </span><span class=\"nl\">\"mappings\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">    </span><span class=\"nl\">\"properties\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\">      </span><span class=\"nl\">\"@timestamp\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"date\"</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"dirname1\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"keyword\"</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"logical_dirname\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"keyword\"</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"filesize_sum\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"double\"</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"filesize_avg\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"double\"</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"filesize_count\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"long\"</span><span class=\"w\"> </span><span class=\"p\">},</span><span class=\"w\">      </span><span class=\"nl\">\"doc_count\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"p\">{</span><span class=\"w\"> </span><span class=\"nl\">\"type\"</span><span class=\"p\">:</span><span class=\"w\"> </span><span class=\"s2\">\"long\"</span><span class=\"w\"> </span><span class=\"p\">}</span><span class=\"w\">    </span><span class=\"p\">}</span><span class=\"w\">  </span><span class=\"p\">}</span><span class=\"w\"></span><span class=\"p\">}</span><span class=\"w\"></span></code></pre></div></div><h3 id=\"3-add-intelligent-filtering-in-data_selection_query\">3. Add Intelligent Filtering in <code class=\"language-plaintext highlighter-rouge\">data_selection_query</code></h3><ul>  <li>Exclude <code class=\"language-plaintext highlighter-rouge\">_grokparsefailure</code> events.</li>  <li>Require existence of critical grouping fields.</li>  <li>Add script guards against empty or oversized keyword values.</li></ul><div class=\"language-python highlighter-rouge\"><div class=\"highlight\"><pre class=\"highlight\"><code><span class=\"s\">\"data_selection_query\"</span><span class=\"p\">:</span> <span class=\"p\">{</span>  <span class=\"s\">\"bool\"</span><span class=\"p\">:</span> <span class=\"p\">{</span>    <span class=\"s\">\"must\"</span><span class=\"p\">:</span> <span class=\"p\">[</span>      <span class=\"p\">{</span>        <span class=\"s\">\"range\"</span><span class=\"p\">:</span> <span class=\"p\">{</span>          <span class=\"s\">\"@timestamp\"</span><span class=\"p\">:</span> <span class=\"p\">{</span>            <span class=\"s\">\"gte\"</span><span class=\"p\">:</span> <span class=\"sa\">f</span><span class=\"s\">\"</span><span class=\"si\">{</span><span class=\"n\">year</span><span class=\"si\">}</span><span class=\"s\">-01-01T00:00:00Z\"</span><span class=\"p\">,</span>            <span class=\"s\">\"lt\"</span><span class=\"p\">:</span> <span class=\"sa\">f</span><span class=\"s\">\"</span><span class=\"si\">{</span><span class=\"n\">year</span> <span class=\"o\">+</span> <span class=\"mi\">1</span><span class=\"si\">}</span><span class=\"s\">-01-01T00:00:00Z\"</span>          <span class=\"p\">}</span>        <span class=\"p\">}</span>      <span class=\"p\">}</span>    <span class=\"p\">],</span>    <span class=\"s\">\"must_not\"</span><span class=\"p\">:</span> <span class=\"p\">[</span>      <span class=\"p\">{</span> <span class=\"s\">\"term\"</span><span class=\"p\">:</span> <span class=\"p\">{</span> <span class=\"s\">\"tags\"</span><span class=\"p\">:</span> <span class=\"s\">\"_grokparsefailure\"</span> <span class=\"p\">}</span> <span class=\"p\">}</span>    <span class=\"p\">],</span>    <span class=\"s\">\"filter\"</span><span class=\"p\">:</span> <span class=\"p\">[</span>      <span class=\"p\">{</span> <span class=\"s\">\"exists\"</span><span class=\"p\">:</span> <span class=\"p\">{</span> <span class=\"s\">\"field\"</span><span class=\"p\">:</span> <span class=\"s\">\"logical_dirname.keyword\"</span> <span class=\"p\">}</span> <span class=\"p\">},</span>      <span class=\"p\">{</span>        <span class=\"s\">\"script\"</span><span class=\"p\">:</span> <span class=\"p\">{</span>          <span class=\"s\">\"script\"</span><span class=\"p\">:</span> <span class=\"s\">\"doc['logical_dirname.keyword'].size() &gt; 0 &amp;&amp; doc['logical_dirname.keyword'].value.length() &lt; 1000\"</span>        <span class=\"p\">}</span>      <span class=\"p\">}</span>    <span class=\"p\">]</span>  <span class=\"p\">}</span><span class=\"p\">}</span></code></pre></div></div><h3 id=\"4-reduce-page_size\">4. Reduce <code class=\"language-plaintext highlighter-rouge\">page_size</code></h3><p>Lowering <code class=\"language-plaintext highlighter-rouge\">page_size</code> from <code class=\"language-plaintext highlighter-rouge\">1000</code> to <code class=\"language-plaintext highlighter-rouge\">50</code> significantly reduced memory pressure per composite aggregation page and helped avoid <code class=\"language-plaintext highlighter-rouge\">502 Bad Gateway</code> failures.</p><h3 id=\"5-restart-the-transform\">5. Restart the Transform</h3><p>After recreating the index and updating the transform definition, restart the job.</p><h2 id=\"conclusion\">Conclusion</h2><p>By combining explicit mappings, stronger filtering, smaller pagination, and a reset of stale transform state, the transform ran reliably and produced accurate summaries without repeated failure loops.</p><p>This debugging story reinforced a key lesson: robust pipelines are not just about handling valid data, but actively excluding invalid or malformed records before they poison downstream aggregation logic.</p>",
            "url": "https://hpc.social/personal-blog/2026/opensearch-transform-job-the-case-of-the-silent-failure-and-the-ghost-key/",
            
            
            
            
            
            "date_published": "2026-02-13T05:00:00-07:00",
            "date_modified": "2026-02-13T05:00:00-07:00",
            
                "author": "Derek Weitzel's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/hpc-in-an-ai-world-swimming-upstream-with-more-conviction/",
            "title": "HPC in an AI world- swimming upstream with more conviction",
            "summary": null,
            "content_text": "Dan Reed recently published an essay, HPC In An AIWorld, that summarizes a longer-form statement piece he co-authoredwith Jack Dongarra and Dennis Gannon called Ride the Wave, Build the Future: Scientific Computing in an AI World. It's worth a read since, as withmuch of Dr. Reed's writing, it takes a necessary, hard look at wherethe HPC community needs to look as the world underneath it shifts as aresult of the massive market forces driving AI.This is a topic about which I've written at length in the past on myblog, and as I read Dr. Reed's latest post (and the Riding the Wave paper thatmotivated it), I found myself agreeing with a many of his positions butdisagreeing with some others.My own background is in the world at the center of Dr. Reed'swriting: traditional HPC for scientific computing at the national scale.However, my outlook has also been colored by the years I spent atMicrosoft supporting massive-scale supercomputing infrastructure fortraining frontier models and the days I now spend at VAST, steeped inthe wider enterprise AI market. This undoubtedly results in an unusual lens through which I now view Dr. Reed's position, and I couldn'thelp but mark up his essay with my own notes as I read through it.In the event that my perspective--that of an HPC-turned-AIinfrastructure practitioner--is of interest to anyone who found Dr.Reed's latest essay as engaging as I did, I've shared them below.New Maxim Two: Energy and data movement, not floating pointoperations, are the scarce resources.This has been true long before exascale in the HPC world. This is nota new maxim. Ironically, it is in the AI world that this maxim isrelatively new; as inference overtakes training as the predominantconsumer of GPU cycles, we are seeing widespread shortages of DRAMbecause of the extreme demand for HBM and the memory bandwidth itprovides.New Maxim Three: Benchmarks are mirrors, notlevers. Benchmarks rarely drive technical change. Instead,they are snapshots of past and current reality, highlighting progress(or the lack thereof), but they have little power to influence strategicdirections.Benchmarks drive technical change amongst technology providers whoact without conviction. The tech industry is full of companies who areblindly chasing consumer demand, and these companies design entireproduct lines to achieve high benchmark results with the mistaken beliefthat those benchmarks are a reasonable proxy for actual productivity.And even worse, many buyers (especially in lower-sophistication marketslike enterprise) also believe that benchmarks, by virtue of beingdesigned by community organizations who have ostensibly thought deeplyabout performance, are a good proxy for productivity, make purchasingdecisions around these same benchmarks.The net result is that a bad set of benchmarks can create and sustainan entire economy of buyers and sellers who think they are buying andselling something useful, when in fact they are wasting resources (time,energy, and COGS) because none of them actually understand what reallydrives productivity within their organizations.Fortunately, the HPC community is generally savvier than enterprises,and most national computing centers now recognize that HPL is simply nota meaningful yardstick. While it used to be good for convincingpoliticians and other non-technical funders that good work was being  done, the discourse around AI has squarely put Rmax in the ground as ameaningful metric. Politicians now understand \"hundreds of thousands ofGPUs\" or \"gigawatts,\" neither of which require a benchmark like HPL toprove.Also, as an aside, I find it ironic that a paper with Jack Dongarralisted as an author is now saying HPL is a snapshot of the past. I'veheard that he is the reason that HPL results achieved using emulatedFP64 are not allowed on Top500. Despite achieving the required residualsthrough more innovative means than simply brute-forcing a problemthrough FP64 ALUs, using techniques like the Ozaki scheme were deemedincompatible with the purpose of Top500. Which is to say, I think he'sthe reason why HPL and Top500 has been reduced to a benchmark thatreflects outputs (hardware FP64 throughput) rather than outcomes(solving a system of equations using LU decomposition).New Maxim Four: Winning systems are co-designedend-to-end—workflow first, parts list second.…In HPC, we must pivot to funding sustained co-design ecosystems thatbet on specific, high-impact scientific workflowsI don't agree with this. Funding sustained co-design is just swimmingupstream with more conviction.The real way forward is to find ways to align scientific discoverywith the way the technology landscape is moving. This means truly ridingthe wave and accepting that scientific discovery may have to turn tocompletely different techniques that achieve their desired precision andvalidation through means that may render obsolete the skills andexpertise some people have spent their careers developing.Consider the scaffolding of end-to-end workflow automation; a richecosystem of technologies exists in the enterprise and hyperscale worldsthat have been used to build extreme-scale, globally distributed,resilient, observable, and high-performance workflows that combineultra-scalable analytics engines with exascale data warehouses. However,realizing these capabilities in practice requires fundamentallyrethinking the software infrastructure on which everything is built. Therigidities of Slurm and the inherent insecurities of relying on ACL- andkernel-based authentication and authorization need to be abandoned, orat least understood to be critically limiting factors that the HPCcommunity chains itself to.To make this very specific, consider a bulk-synchronous MPI jobrunning across a hundred thousand GPUs; if one node fails, the whole jobfails. The \"swimming upstream with more conviction\" way of solving thisproblem is to pay a storage company to build a faster file system, paysome researchers to develop a domain-specific checkpoint library thatglues the MPI application to platform-specific APIs, and pay SchedMD toautomate fast restart based on these two enhancements. Fund all threeprojects under the same program, and it is arguably a \"co-designedend-to-end workflow.\"Riding the wave would be something different though: instead ofrequiring a job requeue and full restart from checkpoint upon jobfailure, treat the entire job as an end-to-end workflow. If a nodefails, the job doesn't stop; it just transitions into a recovery state,where the orchestrator gives it a new node on which the job runtime canrebuild the state of the dead node using distributed parity ordomain-specific knowledge. A fast file system is completely unnecessaryfor failure recovery. But the application developers would have toabandon the model of an application being a single process invocation infavor of the application being a system whose state evolves with theunderlying hardware.Slurm can't do any of this, because Slurm is tied to the MPI model ofparallel execution which assumes nothing ever fails. Which is to say, Ithink co-design should be deferred until a time that the HPC communityfirst recognizes that, so long as they continue to approach end-to-endco-design as an HPC problem to be solved by HPC people using HPC approaches, they will continueto swim upstream regardless of how much co-design they do.New Maxim Five: Research requires prototyping atscale (and risking failure), otherwise it is procurement.A variant of our 2023 maxim, prototyping – testing new and novel ideas –means accepting the risk of failure, otherwise it is simply incrementaldevelopment. Implicit in the notion of prototyping is the need to testmultiple ideas, then harvest the ones with promise. Remember, aprototype that cannot fail has another name – it’s called a product.The idea is right, but the title is wrong. Prototyping at scale isthe wrong way to think about developing leadership supercomputing capability. The largestcommercial AI infrastructure providers do not prototype at scale. Instead,they frame their thinking differently: anything done at scale isproduction, and if it doesn't work, make it work.In practice, this means foregoing years-long acceptance test processesand beating up suppliers over hundred-page-long statements of work.Instead, they accept the reality that they share the responsibility ofintegration with their suppliers, and if things go sideways, they areworking with partners who will not walk away when times get tough.National-scale supercomputing has always been this way in practice,but the HPC community likes to pretend that it isn't. Consider Aurora:if that system wasn't a prototype-at-scale, I don't know what is. Thatsystem's deployment and operations was and remains fraught, and it isbuilt on processors and nodes that were cancelled as products before the system even entered production. Yet the theatrics of acceptance testingwent on, Intel got paid something, and we all pretend like Aurora justlike Frontier or Perlmutter.AI doesn’t prototype at scale; they just take a risk because the nextbreakthrough can't wait for every \"i\" to be dotted and \"t\" to becrossed. If a hyperscale AI system is a failure, that’s fine. The demandfor FLOPS is sufficiently high that it will be utilized by someone forsomething, even if that use generates low-value results rather than thenext frontier model that it was meant to build. The same is true forsystems like Aurora; it's not like these systems sit idle, even if theydon't live up to their original vision.And rest assured, AI systems prove to be bad ideas just like HPCsystems do. The difference is scale: there are multi-billion-dollar AIsupercomputers in existence that were obsolete before they even cameonline, because the problem they were designed to solve becameirrelevant in the years it took to build them. But what was really lost?A bit of money and a little time. The GPUs are still used for day-to-day R&amp;D or inferencing, and the time lost was made up for inlessons learned for the systems that followed.All the big AI systems are prototypes, because AIworkloads themselves are continually evolving prototypes. As a result, the line between prototype and production become blurry, if notmeaningless.All too often, in scientific computing, our gold is buriedin disparate, multi-disciplinary datasets. This needs to change; we mustbuild sustainable, multidisciplinary data fusion.This is so easy to say, but it always feels empty when it is said.What’s stopping this data fusion? I don’t think it’s willpower orresources. It’s just really difficult to figure out what good any of itwould be within a standard theory-based modeling framework. Makingproductive use of fused multimodal data (meshes, particles, and discreteobservations, for example) requires multimodal, multiphysics models. Andsuch models are really expensive relative to the insights theydeliver.To me, this means the challenge isn't in getting the world'sscientific data to hold hands and sing kumbaya; it's accepting thatthere's limited value in actually doing this data fusion unless you'rewilling to also take on more approximations within the models that usethem so that the net return--science per dollar--comes out as a netpositive over today's physics-based, single-mode scientific models.The AI community accepts that wholly empirical models are much lessinterpretable but can much more readily turn multimodal data intoresults in a meaningfully faster, most resource-efficient way. forexample the Aurora model and how it took all sorts of disparate climate datasets to develop an incredibly efficient forecasting tool. In aminute on a single GPU, it produces forecasts of comparable quality towhat would take hours across multiple GPUs using a physics-based model.And it achieves this efficiency by having trained on a diversecollection of gridded 3D atmosphere data and tabular data that wasfused.The only problem, of course, is that the model is much lessinterpretable than a physics-based model. If the Aurora model's forecastis off, forecasters mostly have to shrug and move on with life. But forthe purposes of solving the scientific problem at hand (predicting theweather a few days out), that may be good enough.Governments must now treat advanced computing as a strategicutility, requiring a scale of coordination and investment that rivalsthe ManhattanProject or the Apolloprogram.Manhattan Project and the Apollo mission had distinct goals with adefined \"lump of work\" required to achieve them. They are notcomparable. Computing is a commodity, and it’s a far fairer comparisonto liken it to oil or gas reserves. And even then, exactly what good arethese computing reserves or capabilities really? Is it one bigsupercomputer, or many small ones? What are the range of problems thatsuch a strategic utility would be called upon to solve?In the AI game, advanced computing is certainly a pillar ofcompetitiveness, but it is not necessarily the most limiting one.DeepSeek showed us that ingenuity and massive computing are twoorthogonal axes towards developing new capabilities. They showed that,although you can spend a ton of money on GPUs to train a new frontiermodel, you can also be a lot more clever about how you use much fewerGPUs to do the same thing. And the ratio of people to capital thatresulted in DeepSeek-R1 arguably showed that investing in innovation,not just datacenter buildout, has a much higher return oninvestment.In the context of the above statement, I think governments would dofar better to treat its innovators as a strategic asset and worry lessabout issuing press releases that lead with how many thousands of GPUsthey will deploy. For every thousand GPUs to be deployed on governmentland in the US this year, how many government researchers, architects,and visionaries have headed out the door and are never coming back?",
            "content_html": "<p>Dan Reed recently published an essay, <a href=\"https://hpcdan.org/2026/02/06/hpc-in-an-ai-world/\">HPC In An AIWorld</a>, that summarizes a longer-form statement piece he co-authoredwith Jack Dongarra and Dennis Gannon called <a href=\"https://hpcdan.org/wp-content/uploads/2026/01/Ride-The-Wave-Build-The-Future.pdf\">Ride the Wave, Build the Future: Scientific Computing in an AI World</a>. It's worth a read since, as withmuch of Dr. Reed's writing, it takes a necessary, hard look at wherethe HPC community needs to look as the world underneath it shifts as aresult of the massive market forces driving AI.</p><p>This is a topic about which I've written at length in the past on myblog, and as I read Dr. Reed's latest post (and the Riding the Wave paper thatmotivated it), I found myself agreeing with a many of his positions butdisagreeing with some others.</p><p>My own background is in the world at the center of Dr. Reed'swriting: traditional HPC for scientific computing at the national scale.However, my outlook has also been colored by the years I spent atMicrosoft supporting massive-scale supercomputing infrastructure fortraining frontier models and the days I now spend at VAST, steeped inthe wider enterprise AI market. This undoubtedly results in an unusual lens through which I now view Dr. Reed's position, and I couldn'thelp but mark up his essay with my own notes as I read through it.</p><p>In the event that my perspective--that of an HPC-turned-AIinfrastructure practitioner--is of interest to anyone who found Dr.Reed's latest essay as engaging as I did, I've shared them below.</p><div class=\"separator\" style=\"clear: both; display: none; text-align: center;\"></div><blockquote><p><b>New Maxim Two: Energy and data movement, not floating pointoperations, are the scarce resources.</b></p></blockquote><p>This has been true long before exascale in the HPC world. This is nota new maxim. Ironically, it is in the AI world that this maxim isrelatively new; as inference overtakes training as the predominantconsumer of GPU cycles, we are seeing widespread shortages of DRAMbecause of the extreme demand for HBM and the memory bandwidth itprovides.</p><blockquote><p><b>New Maxim Three: Benchmarks are mirrors, notlevers. Benchmarks rarely drive technical change. Instead,they are snapshots of past and current reality, highlighting progress(or the lack thereof), but they have little power to influence strategicdirections.</b></p></blockquote><p>Benchmarks drive technical change amongst technology providers whoact without conviction. The tech industry is full of companies who areblindly chasing consumer demand, and these companies design entireproduct lines to achieve high benchmark results with the mistaken beliefthat those benchmarks are a reasonable proxy for actual productivity.And even worse, many buyers (especially in lower-sophistication marketslike enterprise) also believe that benchmarks, by virtue of beingdesigned by community organizations who have ostensibly thought deeplyabout performance, are a good proxy for productivity, make purchasingdecisions around these same benchmarks.</p><p>The net result is that a bad set of benchmarks can create and sustainan entire economy of buyers and sellers who think they are buying andselling something useful, when in fact they are wasting resources (time,energy, and COGS) because none of them actually understand what reallydrives productivity within their organizations.</p><p>Fortunately, the HPC community is generally savvier than enterprises,and most national computing centers now recognize that HPL is simply nota meaningful yardstick. While it used to be good for convincingpoliticians and other non-technical funders that good work was being  done, the discourse around AI has squarely put R<sub>max</sub> in the ground as ameaningful metric. Politicians now understand \"hundreds of thousands ofGPUs\" or \"gigawatts,\" neither of which require a benchmark like HPL toprove.</p><p>Also, as an aside, I find it ironic that a paper with Jack Dongarralisted as an author is now saying HPL is a snapshot of the past. I'veheard that he is the reason that HPL results achieved using emulatedFP64 are not allowed on Top500. Despite achieving the required residualsthrough more innovative means than simply brute-forcing a problemthrough FP64 ALUs, using techniques like the Ozaki scheme were deemedincompatible with the purpose of Top500. Which is to say, I think he'sthe reason why HPL and Top500 has been reduced to a benchmark thatreflects outputs (hardware FP64 throughput) rather than outcomes(solving a system of equations using LU decomposition).</p><blockquote><p><b>New Maxim Four: Winning systems are co-designedend-to-end—workflow first, parts list second.</b></p><p><b>…</b></p><p><b>In HPC, we must pivot to funding sustained co-design ecosystems thatbet on specific, high-impact scientific workflows</b></p></blockquote><p>I don't agree with this. Funding sustained co-design is just swimmingupstream with more conviction.</p><p>The real way forward is to find ways to align scientific discoverywith the way the technology landscape is moving. This means truly ridingthe wave and accepting that scientific discovery may have to turn tocompletely different techniques that achieve their desired precision andvalidation through means that may render obsolete the skills andexpertise some people have spent their careers developing.</p><p>Consider the scaffolding of end-to-end workflow automation; a richecosystem of technologies exists in the enterprise and hyperscale worldsthat have been used to build extreme-scale, globally distributed,resilient, observable, and high-performance workflows that combineultra-scalable analytics engines with exascale data warehouses. However,realizing these capabilities in practice requires fundamentallyrethinking the software infrastructure on which everything is built. Therigidities of Slurm and the inherent insecurities of relying on ACL- andkernel-based authentication and authorization need to be abandoned, orat least understood to be critically limiting factors that the HPCcommunity chains itself to.</p><p>To make this very specific, consider a bulk-synchronous MPI jobrunning across a hundred thousand GPUs; if one node fails, the whole jobfails. The \"swimming upstream with more conviction\" way of solving thisproblem is to pay a storage company to build a faster file system, paysome researchers to develop a domain-specific checkpoint library thatglues the MPI application to platform-specific APIs, and pay SchedMD toautomate fast restart based on these two enhancements. Fund all threeprojects under the same program, and it is arguably a \"co-designedend-to-end workflow.\"</p><p>Riding the wave would be something different though: instead ofrequiring a job requeue and full restart from checkpoint upon jobfailure, treat the entire job as an end-to-end workflow. If a nodefails, the job doesn't stop; it just transitions into a recovery state,where the orchestrator gives it a new node on which the job runtime canrebuild the state of the dead node using distributed parity ordomain-specific knowledge. A fast file system is completely unnecessaryfor failure recovery. But the application developers would have toabandon the model of an application being a single process invocation infavor of the application being a system whose state evolves with theunderlying hardware.</p><p>Slurm can't do any of this, because Slurm is tied to the MPI model ofparallel execution which assumes nothing ever fails. Which is to say, Ithink co-design should be deferred until a time that the HPC communityfirst recognizes that, so long as they continue to approach end-to-endco-design as an HPC problem to be solved by HPC people using HPC approaches, they will continueto swim upstream regardless of how much co-design they do.</p><blockquote><p><b>New Maxim Five: Research requires prototyping atscale (and risking failure), otherwise it is procurement.A variant of our 2023 maxim, prototyping – testing new and novel ideas –means accepting the risk of failure, otherwise it is simply incrementaldevelopment. Implicit in the notion of prototyping is the need to testmultiple ideas, then harvest the ones with promise. Remember, aprototype that cannot fail has another name – it’s called a product.</b></p></blockquote><p>The idea is right, but the title is wrong. Prototyping at scale isthe wrong way to think about developing leadership supercomputing capability. The largestcommercial AI infrastructure providers do not prototype at scale. Instead,they frame their thinking differently: anything done at scale isproduction, and if it doesn't work, make it work.</p><p>In practice, this means foregoing <a href=\"https://cdn.lanl.gov/files/ats-5-rfp-sept2024_d80e2.pdf#page=55\">years-long acceptance test processes</a>and beating up suppliers over hundred-page-long statements of work.Instead, they accept the reality that they share the responsibility ofintegration with their suppliers, and if things go sideways, they areworking with partners who will not walk away when times get tough.</p><p>National-scale supercomputing has always been this way in practice,but the HPC community likes to pretend that it isn't. Consider Aurora:if that system wasn't a prototype-at-scale, I don't know what is. Thatsystem's <a href=\"https://www.tomshardware.com/news/us-governments-aurora-supercomputer-delayed-due-to-intels-7nm-setback\">deployment and operations was and remains fraught</a>, and it isbuilt on processors and nodes that <a href=\"https://www.servethehome.com/intel-ponte-vecchio-spaceship-gpu-no-longer-hunting-new-clusters/\">were cancelled as products</a> <a href=\"https://www.alcf.anl.gov/news/argonne-releases-aurora-exascale-supercomputer-researchers\">before the system even entered production</a>. Yet the theatrics of acceptance testingwent on, Intel got paid something, and we all pretend like Aurora justlike Frontier or Perlmutter.</p><p>AI doesn’t prototype at scale; they just take a risk because the nextbreakthrough can't wait for every \"i\" to be dotted and \"t\" to becrossed. If a hyperscale AI system is a failure, that’s fine. The demandfor FLOPS is sufficiently high that it will be utilized by someone forsomething, even if that use generates low-value results rather than thenext frontier model that it was meant to build. The same is true forsystems like Aurora; it's not like these systems sit idle, even if theydon't live up to their original vision.</p><p>And rest assured, AI systems prove to be bad ideas just like HPCsystems do. The difference is scale: there are multi-billion-dollar AIsupercomputers in existence that were obsolete before they even cameonline, because the problem they were designed to solve becameirrelevant in the years it took to build them. But what was really lost?A bit of money and a little time. The GPUs are still used for day-to-day R&amp;D or inferencing, and the time lost was made up for inlessons learned for the systems that followed.</p><p>All the big AI systems are prototypes, because AIworkloads themselves are continually evolving prototypes. As a result, the line between prototype and production become blurry, if notmeaningless.</p><blockquote><p><b>All too often, in scientific computing, our gold is buriedin disparate, multi-disciplinary datasets. This needs to change; we mustbuild sustainable, multidisciplinary data fusion.</b></p></blockquote><p>This is so easy to say, but it always feels empty when it is said.What’s stopping this data fusion? I don’t think it’s willpower orresources. It’s just really difficult to figure out what good any of itwould be within a standard theory-based modeling framework. Makingproductive use of fused multimodal data (meshes, particles, and discreteobservations, for example) requires multimodal, multiphysics models. Andsuch models are really expensive relative to the insights theydeliver.</p><p>To me, this means the challenge isn't in getting the world'sscientific data to hold hands and sing kumbaya; it's accepting thatthere's limited value in actually doing this data fusion unless you'rewilling to also take on more approximations within the models that usethem so that the net return--science per dollar--comes out as a netpositive over today's physics-based, single-mode scientific models.</p><p>The AI community accepts that wholly empirical models are much lessinterpretable but can much more readily turn multimodal data intoresults in a meaningfully faster, most resource-efficient way. forexample the <a href=\"https://www.microsoft.com/en-us/research/project/aurora-forecasting/\">Aurora model</a> and how it took <a href=\"https://arxiv.org/html/2405.13063v2\">all sorts of disparate climate datasets</a> to develop an incredibly efficient forecasting tool. In aminute on a single GPU, it produces forecasts of comparable quality towhat would take hours across multiple GPUs using a physics-based model.And it achieves this efficiency by having trained on a diversecollection of gridded 3D atmosphere data and tabular data that wasfused.</p><p>The only problem, of course, is that the model is much lessinterpretable than a physics-based model. If the Aurora model's forecastis off, forecasters mostly have to shrug and move on with life. But forthe purposes of solving the scientific problem at hand (predicting theweather a few days out), that may be good enough.</p><blockquote><p><b>Governments must now treat advanced computing as a strategicutility, requiring a scale of coordination and investment that rivalsthe <a href=\"https://en.wikipedia.org/wiki/Manhattan_Project\">ManhattanProject</a> or the <a href=\"https://en.wikipedia.org/wiki/Apollo_program\">Apolloprogram</a>.</b></p></blockquote><p>Manhattan Project and the Apollo mission had distinct goals with adefined \"lump of work\" required to achieve them. They are notcomparable. Computing is a commodity, and it’s a far fairer comparisonto liken it to oil or gas reserves. And even then, exactly what good arethese computing reserves or capabilities really? Is it one bigsupercomputer, or many small ones? What are the range of problems thatsuch a strategic utility would be called upon to solve?</p><p>In the AI game, advanced computing is certainly a pillar ofcompetitiveness, but it is not necessarily the most limiting one.DeepSeek showed us that ingenuity and massive computing are twoorthogonal axes towards developing new capabilities. They showed that,although you can spend a ton of money on GPUs to train a new frontiermodel, you can also be a lot more clever about how you use much fewerGPUs to do the same thing. And the ratio of people to capital thatresulted in DeepSeek-R1 arguably showed that investing in innovation,not just datacenter buildout, has a much higher return oninvestment.</p><p>In the context of the above statement, I think governments would dofar better to treat its innovators as a strategic asset and worry lessabout issuing press releases that lead with how many thousands of GPUsthey will deploy. For every thousand GPUs to be deployed on governmentland in the US this year, how many government researchers, architects,and visionaries have headed out the door and are never coming back?</p>",
            "url": "https://hpc.social/personal-blog/2026/hpc-in-an-ai-world-swimming-upstream-with-more-conviction/",
            
            
            
            
            
            "date_published": "2026-02-07T22:11:00-07:00",
            "date_modified": "2026-02-07T22:11:00-07:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/who-needs-full-featured-ci-and-why/",
            "title": "Who needs full-featured CI and why",
            "summary": null,
            "content_text": "Ian Duncan has written a great post on CI orchestration called No, Really, Bash Is Not Enough: Why Large-Scale CI Needs an Orchestrator. It does a good job of distinguishing between the simple cases where bash and make really are good enough for CI, and when you actually need a full-featured CI system.I am talking to teams where CI is a load-bearing piece of infrastructure. Teams where 20 or 50 or 200 engineers push code daily. Teams where a broken CI pipeline doesn’t mean one person waits a few extra minutes; it means a queue of pull requests backs up, a deploy window gets missed, and product timelines slip. Teams where CI time is measured in engineering-hours-lost-per-week and has a line item on somebody’s OKRs.It also leans heavily on one of my favorite papers, “Build systems à la carte” by Mokhov et al. From the discussion of that paper:The real takeaway is not that bash is bad. It’s that the design space of build systems has&nbsp;structure, and that structure has been studied, and that the properties you care about (minimality, correctness, support for dynamic dependencies, cloud caching, early cutoff) correspond to specific architectural choices that live at a level of abstraction bash cannot express. When you write a build pipeline in bash, you are either implementing one of the twelve cells in the Mokhov-Mitchell-Jones matrix (poorly, by hand, with strings and exit codes), or you are living in the&nbsp;busy&nbsp;cell and rebuilding everything every time.It’s a long read but a good one, go check it out.",
            "content_html": "<p>Ian Duncan has written a great post on CI orchestration called <em><a href=\"https://www.iankduncan.com/engineering/2026-02-06-bash-is-not-enough/\">No, Really, Bash Is Not Enough: Why Large-Scale CI Needs an Orchestrator</a></em>. It does a good job of distinguishing between the simple cases where bash and make really are good enough for CI, and when you actually need a full-featured CI system.</p><p><span id=\"more-456\"></span></p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>I am talking to teams where CI is a load-bearing piece of infrastructure. Teams where 20 or 50 or 200 engineers push code daily. Teams where a broken CI pipeline doesn’t mean one person waits a few extra minutes; it means a queue of pull requests backs up, a deploy window gets missed, and product timelines slip. Teams where CI time is measured in engineering-hours-lost-per-week and has a line item on somebody’s OKRs.</p></blockquote><p>It also leans heavily on one of my favorite papers, “<a href=\"https://dl.acm.org/doi/10.1145/3236774\">Build systems à la carte</a>” by Mokhov <em>et al</em>. From the discussion of that paper:</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The real takeaway is not that bash is bad. It’s that the design space of build systems has&nbsp;<em>structure</em>, and that structure has been studied, and that the properties you care about (minimality, correctness, support for dynamic dependencies, cloud caching, early cutoff) correspond to specific architectural choices that live at a level of abstraction bash cannot express. When you write a build pipeline in bash, you are either implementing one of the twelve cells in the Mokhov-Mitchell-Jones matrix (poorly, by hand, with strings and exit codes), or you are living in the&nbsp;<code>busy</code>&nbsp;cell and rebuilding everything every time.</p></blockquote><p>It’s a long read but a good one, go check it out.</p>",
            "url": "https://hpc.social/personal-blog/2026/who-needs-full-featured-ci-and-why/",
            
            
            
            
            
            "date_published": "2026-02-07T00:38:16-07:00",
            "date_modified": "2026-02-07T00:38:16-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/quoting-charity-majors/",
            "title": "Quoting Charity Majors",
            "summary": null,
            "content_text": "Charity’s latest post, Bring back ops pride, is an excellent discussion (rant?) on the importance of operations for software systems and why it’s a bad idea to try and pretend it isn’t a real concern, or make conventional application teams do the work in addition to their regular job.“Operations” is not a dirty word, a synonym for toil, or a title for people who can’t write code. May those who shit on ops get the operational outcomes they deserve.You should absolutely go read the full piece, as well as Charity’s earlier post on the Honeycomb blog: You had one job: Why twenty years of DevOps has failed to do it. Below find several pull quotes from the post itself, because there were just too many to choose from.The difference between “dev” and “ops” is not about whether or not you can write code. Dude, it’s 2026:&nbsp;everyone writes software.The difference between dev and ops is a separation of concerns.The hardest technical challenges and the long, stubborn tail of intractable problems have&nbsp;always&nbsp;been on the infrastructure side.&nbsp;That’s why we work&nbsp;so hard&nbsp;to try not to have them—to solve them by partnerships, cloud computing, open source, etc.&nbsp;Anything&nbsp;is better than trying to build them again, starting over from scratch. We know the cost of new code in our bones.As I have said a thousand times: the closer you get to laying bits down on disk, the more conservative (and afraid) you should be.The difference between dev and ops isn’t about writing code or not. But there&nbsp;are&nbsp;differences. In perspective, priorities, and (often) temperament.I touched on a number of these in&nbsp;the article I just wrote on feedback loops, so I’m not going to repeat myself here.The biggest difference I did&nbsp;not&nbsp;mention is that they have different relationships with resources and definitions of success.Infrastructure is a cost center. You aren’t going to make more money if you give ten laptops to everyone in your company, and you aren’t going to make more money by over-spending on infrastructure, either. Great operations engineers and architects never forget that&nbsp;cost is a first class citizen&nbsp;of their engineering decisions.Operational rigor and excellence are not, how shall I say this…not yet something you can take for granted in the tech industry. The most striking thing about the 2025 DORA report was that the&nbsp;majority of companies&nbsp;report that AI is just adding more chaos to a system already defined by chaos. In other words, most companies are bad at ops.",
            "content_html": "<p>Charity’s latest post, <em><a href=\"https://charity.wtf/2026/01/19/bring-back-ops-pride-xpost/\">Bring back ops pride</a></em>, is an excellent discussion (rant?) on the importance of operations for software systems and why it’s a bad idea to try and pretend it isn’t a real concern, or make conventional application teams do the work in addition to their regular job.</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>“Operations” is not a dirty word, a synonym for toil, or a title for people who can’t write code. May those who shit on ops get the operational outcomes they deserve.</p></blockquote><p>You should absolutely go read the <a href=\"https://charity.wtf/2026/01/19/bring-back-ops-pride-xpost/\">full piece</a>, as well as Charity’s earlier post on the Honeycomb blog: <em><a href=\"https://www.honeycomb.io/blog/you-had-one-job-why-twenty-years-of-devops-has-failed-to-do-it\">You had one job: Why twenty years of DevOps has failed to do it</a></em>. </p><p>Below find several pull quotes from the post itself, because there were just too many to choose from.</p><p><span id=\"more-430\"></span></p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The difference between “dev” and “ops” is not about whether or not you can write code. Dude, it’s 2026:&nbsp;<strong>everyone writes software</strong>.</p><p>The difference between dev and ops is a separation of concerns.</p></blockquote><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The hardest technical challenges and the long, stubborn tail of intractable problems have&nbsp;<em>always</em>&nbsp;been on the infrastructure side.&nbsp;<strong>That’s why we work&nbsp;<em>so hard</em>&nbsp;to try not to have them</strong>—to solve them by partnerships, cloud computing, open source, etc.&nbsp;<em>Anything</em>&nbsp;is better than trying to build them again, starting over from scratch. We know the cost of new code in our bones.</p><p>As I have said a thousand times: the closer you get to laying bits down on disk, the more conservative (and afraid) you should be.</p></blockquote><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The difference between dev and ops isn’t about writing code or not. But there&nbsp;<em>are</em>&nbsp;differences. In perspective, priorities, and (often) temperament.</p><p>I touched on a number of these in&nbsp;<a href=\"https://www.honeycomb.io/blog/you-had-one-job-why-twenty-years-of-devops-has-failed-to-do-it\">the article I just wrote on feedback loops</a>, so I’m not going to repeat myself here.</p><p>The biggest difference I did&nbsp;<em>not</em>&nbsp;mention is that they have different relationships with resources and definitions of success.</p><p>Infrastructure is a cost center. You aren’t going to make more money if you give ten laptops to everyone in your company, and you aren’t going to make more money by over-spending on infrastructure, either. Great operations engineers and architects never forget that&nbsp;<strong>cost is a first class citizen</strong>&nbsp;of their engineering decisions.</p></blockquote><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Operational rigor and excellence are not, how shall I say this…not yet something you can take for granted in the tech industry. The most striking thing about the 2025 DORA report was that the&nbsp;<em>majority of companies</em>&nbsp;report that AI is just adding more chaos to a system already defined by chaos. In other words, most companies are bad at ops.</p></blockquote>",
            "url": "https://hpc.social/personal-blog/2026/quoting-charity-majors/",
            
            
            
            
            
            "date_published": "2026-01-19T17:47:51-07:00",
            "date_modified": "2026-01-19T17:47:51-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/quoting-nicholas-carlini/",
            "title": "Quoting Nicholas Carlini",
            "summary": null,
            "content_text": "Because when the people training these models justify why they&#8217;re worth it, they appeal to pretty extreme outcomes. When Dario Amodei wrote his essay&nbsp;Machines of Loving Grace, he wrote that he sees the benefits as being extraordinary: &#8220;Reliable prevention and treatment of nearly all natural infectious disease &#8230; Elimination of most cancer &#8230; Prevention of Alzheimer’s &#8230; Improved treatment of most other ailments &#8230; Doubling of the human lifespan.&#8221; These are the benefits that the CEO of Anthropic uses to justify his belief that LLMs are worth it. If you think that these risks sound fanciful, then I might encourage you to consider what benefits you see LLMs as bringing, and then consider if you think the risks&nbsp;are worth it.From Carlini’s recent talk/article on Are large language models worth it?The entire article is well worth reading, but I was struck by this bit near the end. LLM researchers often dismiss (some of) the risks of these models as fanciful. But many of the benefits touted by the labs sound just as fanciful!When we’re evaluating the worth of this research, it’s a good idea to be consistent about how realistic — or how “galaxy brain” — you want to be, with both risks and benefits.",
            "content_html": "<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Because when the people training these models justify why they&#8217;re worth it, they appeal to pretty extreme outcomes. When Dario Amodei wrote his essay&nbsp;<a href=\"https://www.darioamodei.com/essay/machines-of-loving-grace\">Machines of Loving Grace</a>, he wrote that he sees the benefits as being extraordinary: &#8220;Reliable prevention and treatment of nearly all natural infectious disease &#8230; Elimination of most cancer &#8230; Prevention of Alzheimer’s &#8230; Improved treatment of most other ailments &#8230; Doubling of the human lifespan.&#8221; These are the benefits that the CEO of Anthropic uses to justify his belief that LLMs are worth it. If you think that these risks sound fanciful, then I might encourage you to consider what benefits you see LLMs as bringing, and then consider if you think the risks&nbsp;are worth it.</p></blockquote><p>From Carlini’s recent talk/article on <em><a href=\"https://nicholas.carlini.com/writing/2025/are-llms-worth-it.html\">Are large language models worth it?</a></em></p><p>The entire article is well worth reading, but I was struck by this bit near the end. LLM researchers often dismiss (some of) the risks of these models as fanciful. But many of the benefits touted by the labs sound just as fanciful!</p><p>When we’re evaluating the worth of this research, it’s a good idea to be consistent about how realistic — or how “galaxy brain” — you want to be, with both risks and benefits.</p>",
            "url": "https://hpc.social/personal-blog/2026/quoting-nicholas-carlini/",
            
            
            
            
            
            "date_published": "2026-01-18T17:07:12-07:00",
            "date_modified": "2026-01-18T17:07:12-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/robin-sloan-agi-is-already-here/",
            "title": "Robin Sloan- AGI is already here!",
            "summary": null,
            "content_text": "In Robin Sloan’s “pop-up newsletter” Winter Garden, he argues that artificial general intelligence has been with us since the development of GPT-3:The trick is to read plainly.The key word in Artificial General Intelligence is General. That’s the word that makes this AI unlike every other AI: because every other AI was trained for a particular purpose and, &amp; even if it achieved it in spectacular fashion, did not do anything else. Consider landmark models across the decades: the Mark I&nbsp;Perceptron, LeNet, AlexNet, AlphaGo, AlphaFold … these systems were all different, but all alike in this way.Language models were trained for a purpose, too … but, surprise: the mechanism &amp; scale of that training did something new: opened a wormhole, through which a vast field of action &amp; response could be reached. Towering libraries of human writing, drawn together across time &amp; space, all the dumb reasons for it … that’s rich fuel, if you can hold it all in your head.It’s important to emphasize that the open-ended capability of these big models was a genuine surprise, even to their custodians. Once understood, the opportunity was quickly grasped … but the magnitude of that initial whoa?! is still ringing the bell of this century.I’m extreme in this regard: I&nbsp;think 2020’s Language Models are Few-Shot Learners marks the AGI moment. In that paper, OpenAI researchers demonstrated that GPT-3 — at that time, the biggest model of its kind ever trained — performed better on a wide range of linguistic tasks than models trained for those tasks specifically. A more direct title might have been: This Thing Can Do It All?!“AGI” is such a misused, ill-defined term that I honestly don’t find it too useful… but it’s hard to argue with Sloan’s argument here! Certainly if you showed current LLMs to someone from 20 years ago, or even 10, they’d seem like wild science fiction.It also reminds me of a quote from Asimov on the definition of “artificial intelligence” and how the goal posts move as new achievements are retrospectively deemed as “not AI”:[artificial intelligence is] a phrase that we use for any device that does things which, in the past, we have associated only with human intelligence(via Nicholas Carlini)So. Do we have AGI? Do we even meaningfully have AI? What would we have to see for the general consensus to agree they had been achieved?Anyway, they are mostly marketing terms at this point. But it can still be interesting to think about them.Thoughts from a dog walk listening to the Sloan article using ElevenReader.Benny is unimpressed with being asked to pose during his walk",
            "content_html": "<p>In Robin Sloan’s “pop-up newsletter” <em>Winter Garden</em>, <a href=\"https://www.robinsloan.com/winter-garden/agi-is-here/\">he argues that artificial general intelligence has been with us since the development of GPT-3</a>:</p><p><span id=\"more-419\"></span></p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The trick is to read plainly.</p><p>The key word in Artificial General Intelligence is General. That’s the word that makes this AI unlike every other AI: because every other AI was trained for a particular purpose and, &amp; even if it achieved it in spectacular fashion, did not do anything else. Consider landmark models across the decades: the Mark I&nbsp;Perceptron, LeNet, AlexNet, AlphaGo, AlphaFold … these systems were all different, but all alike in this way.</p><p>Language models were trained for a purpose, too … but, surprise: the mechanism &amp; scale of that training did something new: opened a wormhole, through which a vast field of action &amp; response could be reached. Towering libraries of human writing, drawn together across time &amp; space, all the dumb reasons for it … that’s rich fuel, if you can hold it all in your head.</p><p>It’s important to emphasize that the open-ended capability of these big models was a genuine surprise, even to their custodians. Once understood, the opportunity was quickly grasped … but the magnitude of that initial whoa?! is still ringing the bell of this century.</p><p>I’m extreme in this regard: I&nbsp;think 2020’s <a href=\"https://arxiv.org/abs/2005.14165?utm_source=Robin_Sloan_sent_me\">Language Models are Few-Shot Learners</a> marks the AGI moment. In that paper, OpenAI researchers demonstrated that GPT-3 — at that time, the biggest model of its kind ever trained — performed better on a wide range of linguistic tasks than models trained for those tasks specifically. A more direct title might have been: This Thing Can Do It All?!</p></blockquote><p>“AGI” is such a misused, ill-defined term that I honestly don’t find it too useful… but it’s hard to argue with Sloan’s argument here! Certainly if you showed current LLMs to someone from 20 years ago, or even 10, they’d seem like wild science fiction.</p><p>It also reminds me of a quote from Asimov on the definition of “artificial intelligence” and how the goal posts move as new achievements are retrospectively deemed as “not AI”:</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>[artificial intelligence is] a phrase that we use for any device that does things which, in the past, we have associated only with human intelligence</p></blockquote><p>(via <a href=\"https://nicholas.carlini.com/writing/2025/are-llms-worth-it.html\">Nicholas Carlini</a>)</p><p>So. Do we have AGI? Do we even meaningfully have AI? What would we have to see for the general consensus to agree they had been achieved?</p><p>Anyway, they are mostly marketing terms at this point. But it can still be interesting to think about them.</p><hr class=\"wp-block-separator has-alpha-channel-opacity\" /><p>Thoughts from a dog walk listening to the Sloan article using ElevenReader.</p><figure class=\"wp-block-image size-large\"><img class=\"wp-image-421\" height=\"768\" src=\"https://thinking.ajdecon.org/wp-content/uploads/2026/01/img_3019-1024x768.jpg\" width=\"1024\" /><figcaption class=\"wp-element-caption\">Benny is unimpressed with being asked to pose during his walk</figcaption></figure>",
            "url": "https://hpc.social/personal-blog/2026/robin-sloan-agi-is-already-here/",
            
            
            
            
            
            "date_published": "2026-01-18T16:50:37-07:00",
            "date_modified": "2026-01-18T16:50:37-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/tailscale/",
            "title": "tailscale",
            "summary": null,
            "content_text": "Some discussion on bsky of the usefulness of Tailscale, and I’ll just note here how very handy it is for running a personal homelab that includes cloud instances. As well as just having lab connectivity from a laptop or phone on the go!Services I run over Tailscale, just for myself, include:An RSS feed readerA personal git forgeAn IRC bouncerA (poorly maintained) wikiJupyterLabOpen WebUI for playing with local LLMs on a GPU workstationSSH to a powerful workstation, hosted at home but without complex configsAnd probably a few things I’ve forgotten! It’s really just very neat. Sure I could do it all with manual Wireguard configs. But Tailscale just makes the underlying primitive much more ergonomic.",
            "content_html": "<p><a href=\"https://bsky.app/profile/buttplug.engineer/post/3mc6qyarp2c2m\">Some discussion on bsky</a> of the usefulness of Tailscale, and I’ll just note here how very handy it is for running a personal homelab that includes cloud instances. As well as just having lab connectivity from a laptop or phone on the go!</p><p>Services I run over Tailscale, just for myself, include:</p><ul class=\"wp-block-list\"><li>An RSS feed reader</li><li>A personal git forge</li><li>An IRC bouncer</li><li>A (poorly maintained) wiki</li><li>JupyterLab</li><li>Open WebUI for playing with local LLMs on a GPU workstation</li><li>SSH to a powerful workstation, hosted at home but without complex configs</li></ul><p>And probably a few things I’ve forgotten! It’s really just very neat. Sure I could do it all with manual Wireguard configs. But Tailscale just makes the underlying primitive much more ergonomic.</p>",
            "url": "https://hpc.social/personal-blog/2026/tailscale/",
            
            
            
            
            
            "date_published": "2026-01-12T05:13:12-07:00",
            "date_modified": "2026-01-12T05:13:12-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/quoting-antirez-on-ai/",
            "title": "Quoting antirez on AI",
            "summary": null,
            "content_text": "Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months.Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouchedFrom Don’t fall into the anti-AI hype",
            "content_html": "<blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><pre class=\"wp-block-preformatted\">Anyway, back to programming. I have a single suggestion for you, my friend. Whatever you believe about what the Right Thing should be, you can't control it by refusing what is happening right now. Skipping AI is not going to help you or your career. Think about it. Test these new tools, with care, with weeks of work, not in a five minutes test where you can just reinforce your own beliefs. Find a way to multiply yourself, and if it does not work for you, try again every few months.<br /><br />Yes, maybe you think that you worked so hard to learn coding, and now machines are doing it for you. But what was the fire inside you, when you coded till night to see your project working? It was building. And now you can build more and better, if you find your way to use AI effectively. The fun is still there, untouched</pre></blockquote><p>From <em><a href=\"https://antirez.com/news/158\">Don’t fall into the anti-AI hype</a></em></p>",
            "url": "https://hpc.social/personal-blog/2026/quoting-antirez-on-ai/",
            
            
            
            
            
            "date_published": "2026-01-12T03:44:08-07:00",
            "date_modified": "2026-01-12T03:44:08-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/latency-critical-linux-task-scheduling-for-gaming/",
            "title": "Latency-critical Linux task scheduling for gaming",
            "summary": null,
            "content_text": "LWN has an excellent article up on the “latency-criticality aware virtual deadline” (LAVD) scheduler, from a talk at the Linux Plumbers Conference in December.In particular, I appreciate the detailed discussion of using different profilers and performance-analysis tools at different levels to determine how to optimize scheduling to improve two key goals: providing high average FPS while keeping 99th-percentile FPS as low as possible, e.g. to prevent UI stuttering. Optimizing for battery usage is also important, as the Steam Deck was one of the main targets for this work.The key finding that came out of his analysis is perhaps somewhat obvious: a single high-level action, such as moving a character on-screen and emitting a sound based on a key-press event, requires that many tasks work together. Some of the tasks are threads in the game process, but others are not because they are in the game engine, kernel, and device drivers; there are often 20 or 30 tasks in a chain that all need to collaborate. Finding tasks with a high waker or wakee frequency and prioritizing them is the basis of the LAVD scheduling policy.As always with LWN there’s good coverage not only of the talk itself, but also the Q&amp;A following the session and ideas from the audience on tooling and other improvements.Phoronix also covered a different talk from the same conference (I think) on how Meta is using the LAVD scheduler as the basis for a new default scheduler used on their fleet. I haven’t had a chance to watch this talk yet (video linked from the article) but I’m very interested in the idea that the same concepts might be useful to a hyper scaler as well as a device like a Steam Deck.",
            "content_html": "<p><em><a href=\"https://lwn.net/Articles/1051430/\">LWN</a></em> has an excellent article up on the “latency-criticality aware virtual deadline” (LAVD) scheduler, from a talk at the <em>Linux Plumbers Conference</em> in December.</p><p>In particular, I appreciate the detailed discussion of using different profilers and performance-analysis tools at different levels to determine how to optimize scheduling to improve two key goals: providing high average FPS while keeping 99th-percentile FPS as low as possible, e.g. to prevent UI stuttering. Optimizing for battery usage is also important, as the Steam Deck was one of the main targets for this work.</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>The key finding that came out of his analysis is perhaps somewhat obvious: a single high-level action, such as moving a character on-screen and emitting a sound based on a key-press event, requires that many tasks work together. Some of the tasks are threads in the game process, but others are not because they are in the game engine, kernel, and device drivers; there are often 20 or 30 tasks in a chain that all need to collaborate. Finding tasks with a high waker or wakee frequency and prioritizing them is the basis of the LAVD scheduling policy.</p></blockquote><p>As always with <em>LWN</em> there’s good coverage not only of the talk itself, but also the Q&amp;A following the session and ideas from the audience on tooling and other improvements.</p><p><em><a href=\"https://www.phoronix.com/news/Meta-SCX-LAVD-Steam-Deck-Server\">Phoronix</a></em> also covered a different talk from the same conference (I think) on how Meta is using the LAVD scheduler as the basis for a new default scheduler used on their fleet. </p><p>I haven’t had a chance to watch this talk yet (<a href=\"https://youtu.be/KFItEHbFEwg?si=62Hsyr9ydHcOVu9b\">video</a> linked from the article) but I’m very interested in the idea that the same concepts might be useful to a hyper scaler as well as a device like a Steam Deck.</p>",
            "url": "https://hpc.social/personal-blog/2026/latency-critical-linux-task-scheduling-for-gaming/",
            
            
            
            
            
            "date_published": "2026-01-10T17:26:29-07:00",
            "date_modified": "2026-01-10T17:26:29-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/orchestrating-hybrid-quantum-classical-workflows-with-ibm-lsf-inside-the-sqd-workflow-demo-at-sc25/",
            "title": "Orchestrating Hybrid Quantum–Classical Workflows with IBM LSF- Inside the SQD Workflow Demo at SC25",
            "summary": null,
            "content_text": "As we enter 2026, it seems that SC25 is far off in our rearview mirror. But it&rsquo;s only been a bit over a month since the HPC world converged on St. Louis, Missouri for the annual Supercomputing 2025 (SC25) event. SC25 signaled one emerging trend: the exploration of hybrid workflows combining quantum and classical computing, offering a look at how these technologies can work synergistically over time. This was indeed the main topic of the 1st Annual Workshop on Large-Scale Quantum-Classical Computing, a workshop which I found to be very insightful.At the IBM booth, we showcased how IBM LSF can schedule and orchestrate a hybrid quantum–classical workflow across IBM Quantum systems and classical x86 compute.  The demo featured the Sample-based Quantum Diagonalization (SQD) workflow, to estimate the ground-state energy of a Hamiltonian representing a molecular system. SQD is part of the IBM Qiskit add-ons.Before diving into the details on what was demonstrated at SC25, and how LSF was used to manage the workflow, I would like to acknowledge that this work was supported by the Hartree Center for Digital Innovation, a collaboration between UKRI-STFC and IBM. The demonstration was created in close collaboration with Vadim Elisseev and Ritesh Krishna from IBM Research, alongside Gábor Samu and Michael Spriggs from IBM. Additionally, this post does not aim to provide an in-depth look at SQD itself. Rather the focus is on how LSF can manage hybrid quantum-classical workflows across a heterogeneous environment comprised of both quantum and classical resources.Hybrid workflows are not newFor three decades, we have seen the use of accelerators in HPC to drive performance—from GPUs to FPGAs and other specialized architectures. Effective scheduling of tasks in these heterogeneous environments has always been a key consideration for efficiency, scalability—and to maximize the ROI in commercial HPC environments. As resource topologies grow more complex, scheduling must account for characteristics such as connectivity, latency, and dependency constraints across increasingly diverse infrastructures. Quantum Processors (QPUs) are now making their appearance as complementary resources within HPC workflows, aim at challenges such as specific optimization problems, many-body physics and quantum chemistry.Demo detailsThe IBM LSF cluster was deployed on IBM Cloud using the LSF Deployable Architecture, which rapidly deploys and configures a ready-to-use HPC environment. IBM Research provided integration components for LSF in the form of esub and jobstarter scripts. These scripts enable LSF to query the cloud-based IBM Quantum Platform to determine which QPUs are available for a given user account and meet the qubit requirements specified at job submission. The list of eligible QPUs is then sorted by queue length, and the system with the shortest queue is selected as the target for the quantum circuit. These integration scripts (esub and jobstarter) are intended to be made open source at a later time.The LSF environment was deployed on IBM Cloud using the LSF Deployable Architecture v3.1.0:LSF 10.1.0.15RHEL 8.10IBM Cloud profile bx2-16x64 (compute hosts)The IBM Qiskit package versions used:qiskit v2.2.1qiskit-addon-sqd v0.12.0qiskit-ibm-runtime v0.43.0The SQD Python program is available as part of the IBM Qiskit Add-ons (see details here). For this demonstration, the original monolithic SQD script was refactored into four smaller Python programs—each representing a distinct step in the workflow. These steps map directly to LSF jobs, enabling orchestration of the workflow across the quantum and classical HPC resources as shown in the architecture diagram (Figure 1):Stage 1 map the inputs to a quantum problem.Stage 2 optimizes the problem for quantum hardware execution—this is where the circuit is transpiled and optimized for the target QPUStage 3 executes the circuit on the QPU using Qiskit primitivesStage 4 performs post-processing and returns the result in the desired classical formatFigure 1 LSF hybrid quantum-classical workflow demo (Vadim Elisseev, IBM Research)For this demonstration, we used IBM LSF Application Center—a web-based interface for job submission and management. LSF Application Center supports application templates, which simplify job submission by providing predefined forms. Templates were created for both the SQD workflow and the Jupyter Notebook application, which is used to visualize the workflow results.Demo execution stepsWe start by using the SQD template to submit an instance of the SQD workflow (Figure 2) which is used to calculate an approximate ground-energy state of the nitrogen molecule (N2). The submission form is customized to let users specify the script for each step of the workflow and specify the desired number of qubits required on the QPU for the quantum circuit. This parameter is used by LSF to select the appropriate quantum system from the available resources. Note that jobs are submitted to LSF with a done dependency condition, ensuring that each stage runs only after the previous one completes successfully. Stage 2 begins after Stage 1, Stage 3 follows Stage 2, and Stage 4 executes once Stage 3 has finishedFigure 2 LSF Application Center SQD submission formNext, we submit an instance of the Jupyter Notebook to monitor the workflow initiated in Step 1. This notebook is designed for this demonstration to visualize the status of each workflow step, displaying results as they successfully complete. Figure 3 shows the Jupyter submission form.Figure 3 LSF Application Center Jupyter Notebook submission formThe Workload view in the LSF Application Center can be used to monitor the progress of each job within the workflow. Additionally, the Jupyter Notebook instance can be accessed here via the provided hyperlink. Figure 4 shows the workload view in LSF Application Center. This shows a list of jobs in the LSF system.Figure 4 LSF Application Center workload viewAs each stage of the SQD workflow completes, the Jupyter Notebook displays the corresponding output in new browser tabs. This includes qubit coupling maps for the QPUs available on the IBM Quantum Platform for the specific account, a diagram of the circuit mapped to the selected QPU, readings from the QPU, and a plot of the estimated ground-state energy of the N2 molecule.Figure 5 Output from each step of the SQD workflow (Vadim Elisseev, IBM Research)Given that demo environment was built using the LSF Deployable Architecture, IBM Cloud Monitoring is automatically configured. It provides a dashboard for the underlying cloud infrastructure, including detailed hardware metrics. In addition, an LSF Dashboard is available through IBM Cloud Monitoring, showing overall cluster metrics such as total jobs, job status, and queue distribution, along with scheduler performance trends over time. IBM Cloud Monitoring infrastructure view and LSF dashboard are shown in Figure 5.Figure 6 IBM Cloud Monitoring: Infrastructure view, and LSF dashboardA video recording of the end-to-end demonstration can be found here.ConclusionsThis demo marked a milestone by demonstrating that IBM Spectrum LSF can seamlessly orchestrate quantum and classical compute resources for a unified workflow. This example demonstrates a practical approach to integrating quantum capabilities into an existing HPC environment running IBM LSF.This capability lays the foundation for hybrid computing pipelines that integrate emerging quantum hardware into established HPC environments. As organizations adopt these architectures and tools mature, we can expect production-grade workflows tackling complex problems across domains. The future of HPC is not a choice between classical or quantum—it is their convergence, working together to unlock new computational possibilities.The topic of scheduling for hybrid quantum-classical environments will be the subject of an upcoming paper &ldquo;On Topological Aspects of Workflows Scheduling on Hybrid Quantum - High Performance Computing Systems&rdquo; by Vadim Elisseev, Ritesh Krishna, Vasileios Kalantzis, M. Emre Sahin and Gábor Samu.",
            "content_html": "<p>As we enter 2026, it seems that SC25 is far off in our rearview mirror. But it&rsquo;s only been a bit over a month since the HPC world converged on St. Louis, Missouri for the annual <a href=\"https://sc25.supercomputing.org/\">Supercomputing 2025</a> (SC25) event. SC25 signaled one emerging trend: the exploration of hybrid workflows combining quantum and classical computing, offering a look at how these technologies can work synergistically over time. This was indeed the main topic of the 1st Annual Workshop on Large-Scale Quantum-Classical Computing, a workshop which I found to be very insightful.</p><p>At the IBM booth, we showcased how <a href=\"https://www.ibm.com/products/hpc-workload-management\">IBM LSF</a> can schedule and orchestrate a hybrid quantum–classical workflow across IBM Quantum systems and classical x86 compute.  The demo featured the Sample-based Quantum Diagonalization (SQD) workflow, to estimate the ground-state energy of a Hamiltonian representing a molecular system. SQD is part of the <a href=\"https://quantum.cloud.ibm.com/docs/en/guides/qiskit-addons-sqd\">IBM Qiskit add-ons</a>.</p><p>Before diving into the details on what was demonstrated at SC25, and how LSF was used to manage the workflow, I would like to acknowledge that this work was supported by the Hartree Center for Digital Innovation, a collaboration between UKRI-STFC and IBM. The demonstration was created in close collaboration with Vadim Elisseev and Ritesh Krishna from IBM Research, alongside Gábor Samu and Michael Spriggs from IBM. Additionally, this post does not aim to provide an in-depth look at SQD itself. Rather the focus is on how LSF can manage hybrid quantum-classical workflows across a heterogeneous environment comprised of both quantum and classical resources.</p><p><strong>Hybrid workflows are not new</strong></p><p>For three decades, we have seen the use of accelerators in HPC to drive performance—from GPUs to FPGAs and other specialized architectures. Effective scheduling of tasks in these heterogeneous environments has always been a key consideration for efficiency, scalability—and to maximize the ROI in commercial HPC environments. As resource topologies grow more complex, scheduling must account for characteristics such as connectivity, latency, and dependency constraints across increasingly diverse infrastructures. Quantum Processors (QPUs) are now making their appearance as complementary resources within HPC workflows, aim at challenges such as specific optimization problems, many-body physics and quantum chemistry.</p><p><strong>Demo details</strong></p><p>The IBM LSF cluster was deployed on IBM Cloud using the LSF Deployable Architecture, which rapidly deploys and configures a ready-to-use HPC environment. IBM Research provided integration components for LSF in the form of esub and jobstarter scripts. These scripts enable LSF to query the cloud-based IBM Quantum Platform to determine which QPUs are available for a given user account and meet the qubit requirements specified at job submission. The list of eligible QPUs is then sorted by queue length, and the system with the shortest queue is selected as the target for the quantum circuit. These integration scripts (esub and jobstarter) are intended to be made open source at a later time.</p><p>The LSF environment was deployed on IBM Cloud using the <a href=\"https://cloud.ibm.com/catalog/architecture/deploy-arch-ibm-hpc-lsf-1444e20a-af22-40d1-af98-c880918849cb-global\">LSF Deployable Architecture</a> v3.1.0:</p><ul><li>LSF 10.1.0.15</li><li>RHEL 8.10</li><li>IBM Cloud profile bx2-16x64 (compute hosts)</li></ul><p>The IBM Qiskit package versions used:</p><ul><li>qiskit v2.2.1</li><li>qiskit-addon-sqd v0.12.0</li><li>qiskit-ibm-runtime v0.43.0</li></ul><p>The SQD Python program is available as part of the IBM Qiskit Add-ons (see details here). For this demonstration, the original monolithic SQD script was refactored into four smaller Python programs—each representing a distinct step in the workflow. These steps map directly to LSF jobs, enabling orchestration of the workflow across the quantum and classical HPC resources as shown in the architecture diagram (Figure 1):</p><ul><li><strong>Stage 1</strong> map the inputs to a quantum problem.</li><li><strong>Stage 2</strong> optimizes the problem for quantum hardware execution—this is where the circuit is transpiled and optimized for the target QPU</li><li><strong>Stage 3</strong> executes the circuit on the QPU using Qiskit primitives</li><li><strong>Stage 4</strong> performs post-processing and returns the result in the desired classical format</li></ul><p><figure><img src=\"https://www.gaborsamu.com/images/figure1_lsfqc.png\" /></figure><em>Figure 1 LSF hybrid quantum-classical workflow demo (Vadim Elisseev, IBM Research)</em></p><p>For this demonstration, we used IBM LSF Application Center—a web-based interface for job submission and management. LSF Application Center supports application templates, which simplify job submission by providing predefined forms. Templates were created for both the SQD workflow and the Jupyter Notebook application, which is used to visualize the workflow results.</p><p><strong>Demo execution steps</strong></p><ul><li>We start by using the SQD template to submit an instance of the SQD workflow (Figure 2) which is used to calculate an approximate ground-energy state of the nitrogen molecule (N2). The submission form is customized to let users specify the script for each step of the workflow and specify the desired number of qubits required on the QPU for the quantum circuit. This parameter is used by LSF to select the appropriate quantum system from the available resources. Note that jobs are submitted to LSF with a done dependency condition, ensuring that each stage runs only after the previous one completes successfully. Stage 2 begins after Stage 1, Stage 3 follows Stage 2, and Stage 4 executes once Stage 3 has finished</li></ul><p><figure><img src=\"https://www.gaborsamu.com/images/figure2a_lsfqc.png\" /></figure><em>Figure 2 LSF Application Center SQD submission form</em></p><ul><li>Next, we submit an instance of the Jupyter Notebook to monitor the workflow initiated in Step 1. This notebook is designed for this demonstration to visualize the status of each workflow step, displaying results as they successfully complete. Figure 3 shows the Jupyter submission form.</li></ul><p><figure><img src=\"https://www.gaborsamu.com/images/figure3a_lsfqc.png\" /></figure><em>Figure 3 LSF Application Center Jupyter Notebook submission form</em></p><ul><li>The Workload view in the LSF Application Center can be used to monitor the progress of each job within the workflow. Additionally, the Jupyter Notebook instance can be accessed here via the provided hyperlink. Figure 4 shows the workload view in LSF Application Center. This shows a list of jobs in the LSF system.</li></ul><p><figure><img src=\"https://www.gaborsamu.com/images/figure4a_lsfqc.png\" /></figure><em>Figure 4 LSF Application Center workload view</em></p><ul><li>As each stage of the SQD workflow completes, the Jupyter Notebook displays the corresponding output in new browser tabs. This includes qubit coupling maps for the QPUs available on the IBM Quantum Platform for the specific account, a diagram of the circuit mapped to the selected QPU, readings from the QPU, and a plot of the estimated ground-state energy of the N2 molecule.</li></ul><p><figure><img src=\"https://www.gaborsamu.com/images/figure5_lsfqc.png\" /></figure><em>Figure 5 Output from each step of the SQD workflow (Vadim Elisseev, IBM Research)</em></p><ul><li>Given that demo environment was built using the LSF Deployable Architecture, IBM Cloud Monitoring is automatically configured. It provides a dashboard for the underlying cloud infrastructure, including detailed hardware metrics. In addition, an LSF Dashboard is available through IBM Cloud Monitoring, showing overall cluster metrics such as total jobs, job status, and queue distribution, along with scheduler performance trends over time. IBM Cloud Monitoring infrastructure view and LSF dashboard are shown in Figure 5.</li></ul><p><figure><img src=\"https://www.gaborsamu.com/images/figure6_lsfqc.png\" /></figure><em>Figure 6 IBM Cloud Monitoring: Infrastructure view, and LSF dashboard</em></p><p>A video recording of the end-to-end demonstration can be found <a href=\"https://community.ibm.com/community/user/viewdocument/demonstration-of-managing-hybrid-qu?CommunityKey=74d589b7-7276-4d70-acf5-0fc26430c6c0&amp;tab=librarydocuments\">here</a>.</p><p><strong>Conclusions</strong></p><p>This demo marked a milestone by demonstrating that IBM Spectrum LSF can seamlessly orchestrate quantum and classical compute resources for a unified workflow. This example demonstrates a practical approach to integrating quantum capabilities into an existing HPC environment running IBM LSF.</p><p>This capability lays the foundation for hybrid computing pipelines that integrate emerging quantum hardware into established HPC environments. As organizations adopt these architectures and tools mature, we can expect production-grade workflows tackling complex problems across domains. The future of HPC is not a choice between classical or quantum—it is their convergence, working together to unlock new computational possibilities.</p><p>The topic of scheduling for hybrid quantum-classical environments will be the subject of an upcoming paper &ldquo;On Topological Aspects of Workflows Scheduling on Hybrid Quantum - High Performance Computing Systems&rdquo; by Vadim Elisseev, Ritesh Krishna, Vasileios Kalantzis, M. Emre Sahin and Gábor Samu.</p>",
            "url": "https://hpc.social/personal-blog/2026/orchestrating-hybrid-quantum-classical-workflows-with-ibm-lsf-inside-the-sqd-workflow-demo-at-sc25/",
            
            
            
            
            
            "date_published": "2026-01-08T14:22:59-07:00",
            "date_modified": "2026-01-08T14:22:59-07:00",
            
                "author": "Ramblings of a supercomputing enthusiast."
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/my-cousin-vinny-as-an-llm-benchmark/",
            "title": "“My Cousin Vinny” as an LLM benchmark",
            "summary": null,
            "content_text": "Mike Caulfield wrote a very thorough and quite entertaining article about posing the following question to ChatGPT:What were Marisa Tomei’s most famous quotes from My Cousin Vinny and what was the context?Depending on the model selected, the answers to this varied from hilariously wrong, to plausible-but-flawed, to accurate. Interestingly, substantial test-time compute (“thinking time”) seems to be necessary to do a good job here, despite the easy availability online of famous quotes, plot summaries, and even the script. While the fast-response models available for free were prone to hallucinate. At the same time I was struck just how&nbsp;much reasoning time needed to be expended to get this task right. It’s possible that&nbsp;My Cousin Vinny&nbsp;is uniquely hard to parse, but I don’t think that is the case. I’ve tried this with a half dozen other films and the pattern seems to hold. If it’s true that a significant amount of similar film contextualization tasks are solvable with test-time compute but require extensive compute to get it right, it seems to me this could be the basis of a number of useful benchmarks.The full article is well-worth reading, and not only because it discusses My Cousin Vinny in substantial detail (great movie).",
            "content_html": "<p>Mike Caulfield wrote a <a href=\"https://mikecaulfield.substack.com/p/notes-towards-a-narrative-llm-benchmark\">very thorough and quite entertaining article</a> about posing the following question to ChatGPT:</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>What were Marisa Tomei’s most famous quotes from My Cousin Vinny and what was the context?</p></blockquote><p>Depending on the model selected, the answers to this varied from hilariously wrong, to plausible-but-flawed, to accurate. </p><p>Interestingly, substantial test-time compute (“thinking time”) seems to be necessary to do a good job here, despite the easy availability online of famous quotes, plot summaries, and even the script. While the fast-response models available for free were prone to hallucinate. </p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>At the same time I was struck just how&nbsp;<em>much</em> reasoning time needed to be expended to get this task right. It’s possible that&nbsp;<em>My Cousin Vinny</em>&nbsp;is uniquely hard to parse, but I don’t think that is the case. I’ve tried this with a half dozen other films and the pattern seems to hold. If it’s true that a significant amount of similar film contextualization tasks are solvable with test-time compute but require extensive compute to get it right, it seems to me this could be the basis of a number of useful benchmarks.</p></blockquote><p>The <a href=\"https://mikecaulfield.substack.com/p/notes-towards-a-narrative-llm-benchmark\">full article</a> is well-worth reading, and not only because it discusses <em>My Cousin Vinny</em> in substantial detail (great movie).</p>",
            "url": "https://hpc.social/personal-blog/2026/my-cousin-vinny-as-an-llm-benchmark/",
            
            
            
            
            
            "date_published": "2026-01-05T04:04:31-07:00",
            "date_modified": "2026-01-05T04:04:31-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2026/podcasts-and-blogs-i-m-following-in-early-2026/",
            "title": "Podcasts and blogs I’m following in early 2026",
            "summary": null,
            "content_text": "As part of the new year, I&#8217;m going through my feed readers for podcasts and blogs. This is mostly a cleanup exercise to remove sources that I regularly skip, but I&#8217;m also adding in a few feeds for sites that I find myself regularly clicking on in social media. As part of this, I figured I&#8217;d share the sources that made the cut to stick around.You&#8217;ll notice that there are a lot of podcasts in this post! With two golden doodles in the family, I spend a lot of time on dog walks, not to mention doing chores around the house. Because of this, it&#8217;s often a lot easier for me to listen to content than read it, and indeed I often find myself feeding long-form text articles into ElevenReader so that I can listen to those items too.Nerd notes:I self-host FreshRSS to aggregate written blog feeds and read them using Reeder. FreshRSS is hosted on a private VPS that I access via Tailscale on my various devices, because I&#8217;m really the only person that needs to access it.I listen to podcasts via Overcast, which I prefer for its audio features to the default Apple Podcasts app.This is not 100% complete as there are some blogs I follow purely through the Patreon site, and I haven’t (yet) taken the time to go through that and add them to this list.There are a few NSFW items left out intentionally, as my voice on this blog is at least semi-professional  Computing-relatedPodcasts:Oxide and Friends is a weekly live show recorded by Bryan Cantrill, Adam Leventhal, and friends from the Oxide Computer Company. Despite being a &#8220;corporate&#8221; podcast, it generally has the vibe of &#8220;Car Talk for Computers&#8221; and can often dig into really interesting computer systems topics, and even computer and industry history. Some of the episodes get into specifics of the Oxide product, which may be interesting or skip-able depending on your interests.Fork Around and Find Out is a resurrection of the Ship It! podcast that used to be part of the Changelog network, focused on production systems, on-call, and large-scale engineering. The updates are a bit irregular but Justin and Autumn are enjoyable hosts to listen to.This is Fine! is a podcast on resilience engineering from Colette Alexander and Clint Byrum. It sometimes picks up topics from discussions in the Resilience in Software Foundation Slack instance and is almost always a fun listen.The Important Thing is an occasional discussion podcast between Michael Lopp and Lyle Troxell. Often very random, with widely varying episode lengths, it&#8217;s nicely chatty and a great listen during dog walks.Signals and Threads, from Jane Street, is another occasional podcast but often gets very deep into interesting software topics such as performance analysis, state machine replication, and memory management &#8212; with a focus on low-latency trading systems that have interesting constraints. The Changelog is a long-standing general software news podcast. I dip in and out of this one based on the topic being covered, and often like their &#8220;Changelog and Friends&#8221; chatty episodes more than the news or interview episodes.The Compute Architecture Podcast updates very infrequently but often has interesting, hardware- or systems-focused interviews.Personal blogs:Charity Majors is a long-time follow of mine for her technical work on observability, her practical SRE/sysadmin mentality, and her useful perspectives on engineering management.Brendan Gregg posts infrequently, but will publish fascinating deep dives on performance engineering that are always worth reading.Chris Siebenmann is a sysadmin at the University of Toronto and a prolific blogger about nuts-and-bolts Linux admin topics. He publishes a ton so I dip in and out of his feed, but always keep it in my feed reader.Cat Hicks does psychological research on software teams and I always learn a ton from her writing.Soatok writes excellent, interesting, and opinionated articles on security and cryptography topics.Fred Hebert is an SRE with a strong interest in resilience engineering who frequently discusses interesting academic research on resilience and human factors.Glenn Lockwood is an HPC engineer who I&#8217;ve known online for a long time, and who came into the field from materials science in a similar manner to me. He&#8217;s worked at SDSC, NERSC, Microsoft, and VAST, and his annual recap of the SuperComputing conference is worth reading every year. (So is the rest of his blog!)Sean Goedecke is a software engineer at Github who writes interesting work on AI and on the dynamics of large companies.Simon Willison is one of the most essential bloggers on AI and LLMs today, not to mention incredibly prolific. His style of writing short posts on whatever he&#8217;s thinking about is one I hope to emulate more often here!Rachel Kroll, aka Rachel By the Bay, is a long-time sysadmin/SRE who writes on detailed sysadmin and software engineering topics in an often-ironic fashion.Xe Iaso is a software engineer and author of the Anubis Web AI Firewall tool. Xer blog covers a wide variety of software, systems, and AI work.Technical blogs and industry news:LWN is the definitive source for Linux and free software news, and is supported by the community via subscriptions. You should subscribe!Chips and Cheese does really interesting deep dives into chip architecture and performance, often focused on newer products but occasionally digging into older hardware.SemiAnalysis is at this point one of the most essential news sources for the semiconducting industry, and one of the few paid sources I follow.Jepsen performs detailed analyses of distributed systems reliability and consistency by Kyle Kingsbury, both as consulting engagements and for the community. Read all of these, they&#8217;re excellent!Semiconductor Engineering is one of the long-standing industry news sites. I don&#8217;t read a ton of this but I do keep an eye on the feed for interesting headlines.Similarly, Data Center Dynamics is one of the standard industry news sites for data centers.News and PoliticsPodcasts:The Lawfare Podcast covers a really wide variety of national security law topics. My only regular listen is their Rational Security episodes which provide a weekly roundup of relevant news in an informal discussion format, but I dip in and out of the others.Money Stuff is a fun weekly podcast from Matt Levine and Katie Greifeld of Bloomberg News, who discuss weekly financial news from a very nerdy perspective. I&#8217;m not generally a huge finance person, but I like that this podcast allows me to listen in to people geeking out about the topic.The World in Brief from the Economist is their daily quick summary of the news. I have very mixed feelings about the Economist in general &#8212; as with many British sources, they platform far too much transphobia &#8212; but I have yet to find a better substitute for &#8220;quick morning summary of the news&#8221;. At least, nothing else that doesn&#8217;t make me want to throw my phone at a wall.Blogs and News:Rest of World covers tech industry news with a focus on impacts outside the West, and often has really interesting coverage from a different angle.Liberal Currents is a political blog focused on liberalism, both in current events and as a political philosophy, and has published a lot of excellent pieces since I started reading it in 2025.MiscellaneousPodcasts:Arms Control Wonk continues to be a good listen, though it&#8217;s updated something sporadically the past few years. The coverage of nuclear weapons, missiles and other delivery systems, and current events around arms control (or lack thereof) is very good. If you sponsor them via Patreon, their Slack instance is also a fascinating discussion forum, though I only dip in and out of it occasionally.The Culture Study Podcast by Anne Helen Petersen features conversations between Anne and a guest and focuses on listener Q&amp;A. It often covers culture topics that I otherwise don&#8217;t get much of through other feeds. For example, recent episodes have talked about anything from birding to K-pop to the anatomy of cultural panics. I don&#8217;t listen to every episode, but they&#8217;re often quite fun.Neon Liberalism is a regular podcast from Liberal Currents. I might put this in the News category except that it often digs into political topics from a historical or theoretical perspective rather than just focusing on current events.Similarly, Reimagining Liberty from Aaron Ross Powell digs into political theory and current events from the perspective of Powell&#8217;s particular strain of libertarian-ism, which is much more in conversation with modern liberalism and anarchism vs more right-wing strains.Personal blogs:Phil Broughton is a health physicist at UC Berkeley who has worked in classified nuclear work at LLNL as well as spending a year in Alaska, and has a wealth of fascinating and hilarious stories.Lois McMaster Bujold is one of my favorite science fiction and fantasy authors. While she&#8217;s semi-retired, she still writes occasional novellas following Penric, a sorcerer in her World of the Five Gods, which I really love. Her blog helpfully announces new stories!John Scalzi is another favorite author, and also has an excellent blog called Whatever.Bret Devereaux writes A Collection of Unmitigated Pedantry about history, the military, and pop culture. If you&#8217;re interested in an extensive deep dive into the military missteps made Saruman in The Two Towers, this is the blog for you!Bits About Money is Patrick McKenzie&#8217;s blog about finance, and each entry tends to be a highly-nerdy deep dive about how some interesting corner of the financial system works.Webcomics:The Order of the Stick is a long-running D&amp;D stick-figure comic that I have been reading for longer than I can really say. I highly recommend it, though I&#8217;ll warn you that with an archive of &gt;1,300 comics (and growing!) you are likely to lose a lot of time this way.Questionable Content is a slice-of-life comic about a coffee shop&#8230; with robots, super-intelligent AIs, stupid dick jokes, and more. Also has a long archive to dig through.Girl Genius is a long-running online comic book about &#8220;Adventure, Romance, and MAD SCIENCE!&#8221; and thoroughly excellent.",
            "content_html": "<p>As part of the new year, I&#8217;m going through my feed readers for podcasts and blogs. This is mostly a cleanup exercise to remove sources that I regularly skip, but I&#8217;m also adding in a few feeds for sites that I find myself regularly clicking on in social media. As part of this, I figured I&#8217;d share the sources that made the cut to stick around.</p><p><span id=\"more-377\"></span></p><p>You&#8217;ll notice that there are a lot of podcasts in this post! With two golden doodles in the family, I spend a lot of time on dog walks, not to mention doing chores around the house. Because of this, it&#8217;s often a lot easier for me to listen to content than read it, and indeed I often find myself feeding long-form text articles into <a href=\"https://elevenreader.io/\">ElevenReader</a> so that I can listen to those items too.</p><p><strong>Nerd notes:</strong></p><ul class=\"wp-block-list\"><li>I self-host <a href=\"https://freshrss.org/index.html\">FreshRSS</a> to aggregate written blog feeds and read them using <a href=\"https://reederapp.com/\">Reeder</a>. FreshRSS is hosted on a private VPS that I access via <a href=\"https://tailscale.com/\">Tailscale</a> on my various devices, because I&#8217;m really the only person that needs to access it.</li><li>I listen to podcasts via <a href=\"https://overcast.fm/\">Overcast</a>, which I prefer for its audio features to the default Apple Podcasts app.</li><li>This is not 100% complete as there are some blogs I follow purely through the Patreon site, and I haven’t (yet) taken the time to go through that and add them to this list.</li><li>There are a few NSFW items left out intentionally, as my voice on this blog is at least semi-professional <img alt=\"😉\" class=\"wp-smiley\" src=\"https://s.w.org/images/core/emoji/17.0.2/72x72/1f609.png\" style=\"height: 1em;\" /> </li></ul><h2 class=\"wp-block-heading\">Computing-related</h2><p><strong>Podcasts:</strong></p><ul class=\"wp-block-list\"><li><em><a href=\"https://oxide-and-friends.transistor.fm/\">Oxide and Friends</a></em> is a weekly live show recorded by Bryan Cantrill, Adam Leventhal, and friends from the <a href=\"https://oxide.computer\">Oxide Computer Company</a>. Despite being a &#8220;corporate&#8221; podcast, it generally has the vibe of &#8220;Car Talk for Computers&#8221; and can often dig into really interesting computer systems topics, and even computer and industry history. Some of the episodes get into specifics of the Oxide product, which may be interesting or skip-able depending on your interests.</li><li><em><a href=\"https://www.fafo.fm/\">Fork Around and Find Out</a></em> is a resurrection of the <em>Ship It!</em> podcast that used to be part of the Changelog network, focused on production systems, on-call, and large-scale engineering. The updates are a bit irregular but Justin and Autumn are enjoyable hosts to listen to.</li><li><em><a href=\"https://www.thisisfinepod.com/\">This is Fine!</a></em> is a podcast on resilience engineering from Colette Alexander and Clint Byrum. It sometimes picks up topics from discussions in the <a href=\"https://resilienceinsoftware.org/\">Resilience in Software Foundation Slack</a> instance and is almost always a fun listen.</li><li><em><a href=\"https://randsinrepose.com/the-important-thing/\">The Important Thing</a> </em>is an occasional discussion podcast between <a href=\"https://randsinrepose.com/\">Michael Lopp</a> and <a href=\"https://troxell.com/\">Lyle Troxell</a>. Often very random, with widely varying episode lengths, it&#8217;s nicely chatty and a great listen during dog walks.</li><li><em><a href=\"https://signalsandthreads.com/\">Signals and Threads</a></em>, from Jane Street, is another occasional podcast but often gets very deep into interesting software topics such as performance analysis, state machine replication, and memory management &#8212; with a focus on low-latency trading systems that have interesting constraints. </li><li><a href=\"https://changelog.com/podcast\"><em>The Changelog</em></a> is a long-standing general software news podcast. I dip in and out of this one based on the topic being covered, and often like their &#8220;Changelog and Friends&#8221; chatty episodes more than the news or interview episodes.</li><li><em><a href=\"https://comparchpodcast.podbean.com/\">The Compute Architecture Podcast</a></em> updates very infrequently but often has interesting, hardware- or systems-focused interviews.</li></ul><p><strong>Personal blogs:</strong></p><ul class=\"wp-block-list\"><li><a href=\"https://charity.wtf/\">Charity Majors</a> is a long-time follow of mine for her technical work on observability, her practical SRE/sysadmin mentality, and her useful perspectives on engineering management.</li><li><a href=\"https://www.brendangregg.com/blog/\">Brendan Gregg</a> posts infrequently, but will publish fascinating deep dives on performance engineering that are always worth reading.</li><li><a href=\"https://utcc.utoronto.ca/~cks/space/blog/\">Chris Siebenmann</a> is a sysadmin at the University of Toronto and a prolific blogger about nuts-and-bolts Linux admin topics. He publishes a <em>ton</em> so I dip in and out of his feed, but always keep it in my feed reader.</li><li><a href=\"https://www.drcathicks.com/blog\">Cat Hicks</a> does psychological research on software teams and I always learn a ton from her writing.</li><li><a href=\"https://soatok.blog/b/\">Soatok</a> writes excellent, interesting, and opinionated articles on security and cryptography topics.</li><li><a href=\"https://ferd.ca/\">Fred Hebert</a> is an SRE with a strong interest in resilience engineering who frequently discusses interesting academic research on resilience and human factors.</li><li><a href=\"https://blog.glennklockwood.com/\">Glenn Lockwood</a> is an HPC engineer who I&#8217;ve known online for a long time, and who came into the field from materials science in a similar manner to me. He&#8217;s worked at SDSC, NERSC, Microsoft, and VAST, and his annual recap of the SuperComputing conference is worth reading every year. (So is the rest of his blog!)</li><li><a href=\"https://www.seangoedecke.com/\">Sean Goedecke</a> is a software engineer at Github who writes interesting work on AI and on the dynamics of large companies.</li><li><a href=\"https://simonwillison.net/\">Simon Willison</a> is one of the most essential bloggers on AI and LLMs today, not to mention incredibly prolific. His style of writing short posts on whatever he&#8217;s thinking about is one I hope to emulate more often here!</li><li>Rachel Kroll, aka <a href=\"https://rachelbythebay.com/w/\">Rachel By the Bay</a>, is a long-time sysadmin/SRE who writes on detailed sysadmin and software engineering topics in an often-ironic fashion.</li><li><a href=\"https://xeiaso.net/blog/\">Xe Iaso</a> is a software engineer and author of the <a href=\"https://anubis.techaro.lol/\">Anubis</a> Web AI Firewall tool. Xer blog covers a wide variety of software, systems, and AI work.<br /></li></ul><p><strong>Technical blogs and industry news:</strong></p><ul class=\"wp-block-list\"><li><em><a href=\"https://lwn.net/\">LWN</a></em> is the definitive source for Linux and free software news, and is supported by the community via subscriptions. You should subscribe!</li><li><em><a href=\"https://chipsandcheese.com/\">Chips and Cheese</a></em> does really interesting deep dives into chip architecture and performance, often focused on newer products but occasionally digging into older hardware.</li><li><a href=\"https://semianalysis.com/\"><em>SemiAnalysis</em></a> is at this point one of the most essential news sources for the semiconducting industry, and one of the few paid sources I follow.</li><li><em><a href=\"https://jepsen.io/blog\">Jepsen</a></em> performs detailed analyses of distributed systems reliability and consistency by <a href=\"https://aphyr.com/\">Kyle Kingsbury</a>, both as consulting engagements and for the community. Read all of these, they&#8217;re excellent!</li><li><em><a href=\"https://semiengineering.com/\">Semiconductor Engineering</a></em> is one of the long-standing industry news sites. I don&#8217;t read a ton of this but I do keep an eye on the feed for interesting headlines.</li><li>Similarly, <em><a href=\"https://www.datacenterdynamics.com/en/\">Data Center Dynamics</a></em> is one of the standard industry news sites for data centers.</li></ul><h2 class=\"wp-block-heading\">News and Politics</h2><p><strong>Podcasts:</strong></p><ul class=\"wp-block-list\"><li><em><a href=\"https://www.lawfaremedia.org/podcasts-multimedia/podcast\">The Lawfare Podcast</a></em> covers a really wide variety of national security law topics. My only regular listen is their <em><a href=\"https://www.lawfaremedia.org/podcasts-multimedia/podcast/rational-security\">Rational Security</a></em> episodes which provide a weekly roundup of relevant news in an informal discussion format, but I dip in and out of the others.</li><li><em><a href=\"https://www.bloomberg.com/podcasts/series/money-stuff\">Money Stuff</a></em> is a fun weekly podcast from Matt Levine and Katie Greifeld of Bloomberg News, who discuss weekly financial news from a very nerdy perspective. I&#8217;m not generally a huge finance person, but I like that this podcast allows me to listen in to people geeking out about the topic.</li><li><em><a href=\"https://www.economist.com/the-world-in-brief\">The World in Brief</a></em> from the Economist is their daily quick summary of the news. I have very mixed feelings about the Economist in general &#8212; as with many British sources, they platform far too much transphobia &#8212; but I have yet to find a better substitute for &#8220;quick morning summary of the news&#8221;. At least, nothing else that doesn&#8217;t make me want to throw my phone at a wall.</li></ul><p><strong>Blogs and News:</strong></p><ul class=\"wp-block-list\"><li><em><a href=\"https://restofworld.org/\">Rest of World</a></em> covers tech industry news with a focus on impacts outside the West, and often has really interesting coverage from a different angle.</li><li><em><a href=\"https://www.liberalcurrents.com/\">Liberal Currents</a></em> is a political blog focused on liberalism, both in current events and as a political philosophy, and has published a lot of excellent pieces since I started reading it in 2025.</li></ul><h2 class=\"wp-block-heading\">Miscellaneous</h2><p><strong>Podcasts:</strong></p><ul class=\"wp-block-list\"><li><a href=\"https://www.armscontrolwonk.com/archive/author/podcast/\"><em>Arms Control Wonk</em></a> continues to be a good listen, though it&#8217;s updated something sporadically the past few years. The coverage of nuclear weapons, missiles and other delivery systems, and current events around arms control (or lack thereof) is very good. If you sponsor them via Patreon, their Slack instance is also a fascinating discussion forum, though I only dip in and out of it occasionally.</li><li><em><a href=\"https://culturestudypod.substack.com/\">The Culture Study Podcast</a></em> by Anne Helen Petersen features conversations between Anne and a guest and focuses on listener Q&amp;A. It often covers culture topics that I otherwise don&#8217;t get much of through other feeds. For example, recent episodes have talked about anything from birding to K-pop to the anatomy of cultural panics. I don&#8217;t listen to every episode, but they&#8217;re often quite fun.</li><li><em><a href=\"https://www.liberalcurrents.com/neonliberalism/\">Neon Liberalism</a></em> is a regular podcast from <a href=\"https://www.liberalcurrents.com/\">Liberal Currents</a>. I might put this in the News category except that it often digs into political topics from a historical or theoretical perspective rather than just focusing on current events.</li><li>Similarly, <em><a href=\"https://open.spotify.com/show/29wW6zsYyYuelcFJcyHOmv\">Reimagining Liberty</a></em> from Aaron Ross Powell digs into political theory and current events from the perspective of Powell&#8217;s particular strain of libertarian-ism, which is much more in conversation with modern liberalism and anarchism vs more right-wing strains.</li></ul><p><strong>Personal blogs:</strong></p><ul class=\"wp-block-list\"><li><a href=\"https://www.funraniumlabs.com/\">Phil Broughton</a> is a health physicist at UC Berkeley who has worked in classified nuclear work at LLNL as well as spending a year in Alaska, and has a wealth of fascinating and hilarious stories.</li><li><a href=\"https://www.goodreads.com/author/show/16094.Lois_McMaster_Bujold/blog\">Lois McMaster Bujold</a> is one of my favorite science fiction and fantasy authors. While she&#8217;s semi-retired, she still writes occasional novellas following Penric, a sorcerer in her World of the Five Gods, which I really love. Her blog helpfully announces new stories!</li><li><a href=\"https://whatever.scalzi.com/\">John Scalzi</a> is another favorite author, and also has an excellent blog called <em>Whatever</em>.</li><li>Bret Devereaux writes <em><a href=\"https://acoup.blog/\">A Collection of Unmitigated Pedantry</a></em> about history, the military, and pop culture. If you&#8217;re interested in an extensive deep dive into the military missteps made Saruman in <em>The Two Towers</em>, this is the blog for you!</li><li><em><a href=\"https://www.bitsaboutmoney.com/\">Bits About Money</a></em> is Patrick McKenzie&#8217;s blog about finance, and each entry tends to be a highly-nerdy deep dive about how some interesting corner of the financial system works.</li></ul><p><strong>Webcomics:</strong></p><ul class=\"wp-block-list\"><li><em><a href=\"https://www.giantitp.com/comics/oots.html\">The Order of the Stick</a></em> is a long-running D&amp;D stick-figure comic that I have been reading for longer than I can really say. I highly recommend it, though I&#8217;ll warn you that with an archive of &gt;1,300 comics (and growing!) you are likely to lose a lot of time this way.</li><li><em><a href=\"https://questionablecontent.net/\">Questionable Content</a></em> is a slice-of-life comic about a coffee shop&#8230; with robots, super-intelligent AIs, stupid dick jokes, and more. Also has a long archive to dig through.</li><li><em><a href=\"https://www.girlgeniusonline.com/\">Girl Genius</a></em> is a long-running online comic book about &#8220;Adventure, Romance, and MAD SCIENCE!&#8221; and thoroughly excellent.</li></ul>",
            "url": "https://hpc.social/personal-blog/2026/podcasts-and-blogs-i-m-following-in-early-2026/",
            
            
            
            
            
            "date_published": "2026-01-02T19:13:35-07:00",
            "date_modified": "2026-01-02T19:13:35-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/on-friday-deploys/",
            "title": "On Friday deploys",
            "summary": null,
            "content_text": "This post from Charity Majors on Friday deploys is well worth reading. In the past I’ve seen her comment on how deployments should be carried out fearlessly regardless of when, and I’ve often felt like saying “yeah, well, …”. Because of course I agree with that as a goal, but many real-world orgs and conditions make it challenging.This most recent post talks about the situations when those freezes can make sense, even if they’re not ideal. And in particular I like the discussion about what really needs to be frozen is not deploys, but merges:To a developer, ideally, the act of merging their changes back to main and those changes being deployed to production should feel like one singular atomic action, the faster the better, the less variance the better. You merge, it goes right out. You don’t want it to go out, you better not merge.The worst of both worlds is when you let devs keep merging diffs, checking items off their todo lists, closing out tasks, for days or weeks. All these changes build up like a snowdrift over a pile of grenades. You aren’t going to find the grenades til you plow into the snowdrift on January 5th, and then you’ll find them with your face. Congrats!",
            "content_html": "<p><a href=\"https://charity.wtf/2025/12/24/on-friday-deploys-sometimes-that-puppy-needs-murdering-xpost/\">This post</a> from Charity Majors on Friday deploys is well worth reading. </p><p>In the past I’ve seen her comment on how deployments should be carried out fearlessly regardless of when, and I’ve often felt like saying “yeah, well, …”. Because of course I agree with that as a goal, but many real-world orgs and conditions make it challenging.</p><p>This most recent post talks about the situations when those freezes <em>can</em> make sense, even if they’re not ideal. And in particular I like the discussion about what really needs to be frozen is not deploys, but merges:</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>To a developer, ideally, the act of merging their changes back to main and those changes being deployed to production should feel like one singular atomic action, the faster the better, the less variance the better. You merge, it goes right out. You don’t want it to go out, you better not merge.</p><p>The worst of both worlds is when you let devs keep merging diffs, checking items off their todo lists, closing out tasks, for days or weeks. All these changes build up like a snowdrift over a pile of grenades. You aren’t going to find the grenades til you plow into the snowdrift on January 5th, and then you’ll find them with your face. Congrats!</p></blockquote>",
            "url": "https://hpc.social/personal-blog/2025/on-friday-deploys/",
            
            
            
            
            
            "date_published": "2025-12-30T20:58:00-07:00",
            "date_modified": "2025-12-30T20:58:00-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/why-generic-software-design-advice-is-often-useless/",
            "title": "Why generic software design advice is often useless",
            "summary": null,
            "content_text": "In You can&#8217;t design software you don&#8217;t work on, Sean Goedecke discusses why generic advice on the design of software systems is often unhelpful.When you’re doing real work, concrete factors dominate generic factors. Having a clear understanding of what the code looks like right now is far, far more important than having a good grasp on general design patterns or principles.This tracks with my experience not just of software systems, but also systems with a hardware component (eg ML training clusters) or a facility component (eg datacenters). The specifics of your system absolutely dominate any general design guidance.As the manager of a team that publishes reference architectures, I do think that it’s helpful to clearly understand where your specific design differs from generic advice. If you’re going off the beaten path, you should know you’re doing that! And be able to plan for any additional validation involved in doing that.But relatedly, this is part of why I think that any generic advice should be based on some actually existing system. If you are telling someone they should follow a given principle, you should be able to point to an implementation that does follow that principle. Or else you’re just speculating into the void. Which admittedly can be fun but is not nearly as valuable as speaking from experience.",
            "content_html": "<p>In <em><a href=\"https://www.seangoedecke.com/you-cant-design-software-you-dont-work-on/\">You can&#8217;t design software you don&#8217;t work on</a>, </em>Sean Goedecke discusses why generic advice on the design of software systems is often unhelpful.</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p><strong>When you’re doing real work, concrete factors dominate generic factors</strong>. Having a clear understanding of what the code looks like right now is far, far more important than having a good grasp on general design patterns or principles.</p></blockquote><p>This tracks with my experience not just of software systems, but also systems with a hardware component (eg ML training clusters) or a facility component (eg datacenters). The specifics of your system absolutely dominate any general design guidance.</p><p>As the manager of a team that publishes reference architectures, I do think that it’s helpful to clearly understand where your specific design differs from generic advice. If you’re going off the beaten path, you should k<em>now </em>you’re doing that! And be able to plan for any additional validation involved in doing that.</p><p>But relatedly, this is part of why I think that any generic advice should be based on some actually existing system. If you are telling someone they should follow a given principle, you should be able to point to an implementation that <em>does</em> follow that principle. </p><p>Or else you’re just speculating into the void. Which admittedly can be <em>fun</em> but is not nearly as valuable as speaking from experience.</p>",
            "url": "https://hpc.social/personal-blog/2025/why-generic-software-design-advice-is-often-useless/",
            
            
            
            
            
            "date_published": "2025-12-30T03:33:15-07:00",
            "date_modified": "2025-12-30T03:33:15-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/large-software-systems/",
            "title": "Large software systems",
            "summary": null,
            "content_text": "In Nobody understands how large software products work, Sean Goedecke makes a number of good points about how difficult it is to really grasp large software systems.In particular, some features impact every part of the system in unforeseen ways:Why are these features complicated? Because&nbsp;they affect every single other feature you build. If you add organizations and policy controls, you must build a policy control for every new feature you add. If you localize your product, you must include translations for every new feature. And so on. Eventually you’re in a position where you’re trying to figure out whether a self-hosted enterprise customer in the EU is entitled to access a particular feature, and&nbsp;nobody knows&nbsp;&#8211; you have to go and read through the code or do some experimenting to figure it out.Sean also points out that eventually the code itself has to be the source of truth, and debugging requires deep investigation of the continually-changing system.I’ve seen this happen in a bunch of different orgs, and it does seem to be true, especially for products with a large number of collaborating teams. I would add that in addition to the code itself, you often need to have conversations with the relevant teams to discern intent and history. Documentation only goes so far, eventually you need talk to people.",
            "content_html": "<p>In <em><a href=\"https://seangoedecke.com/nobody-knows-how-software-products-work/\">Nobody understands how large software products work</a></em>, Sean Goedecke makes a number of good points about how difficult it is to really grasp large software systems.</p><p>In particular, some features impact every part of the system in unforeseen ways:</p><blockquote class=\"wp-block-quote is-layout-flow wp-block-quote-is-layout-flow\"><p>Why are these features complicated? Because&nbsp;<strong>they affect every single other feature you build</strong>. If you add organizations and policy controls, you must build a policy control for every new feature you add. If you localize your product, you must include translations for every new feature. And so on. Eventually you’re in a position where you’re trying to figure out whether a self-hosted enterprise customer in the EU is entitled to access a particular feature, and&nbsp;<em>nobody knows</em>&nbsp;&#8211; you have to go and read through the code or do some experimenting to figure it out.</p></blockquote><p>Sean also points out that eventually the code itself has to be the source of truth, and debugging requires deep investigation of the continually-changing system.</p><p>I’ve seen this happen in a bunch of different orgs, and it does seem to be true, especially for products with a large number of collaborating teams. I would add that in addition to the code itself, you often need to have conversations with the relevant teams to discern intent and history. Documentation only goes so far, eventually you need talk to people.</p>",
            "url": "https://hpc.social/personal-blog/2025/large-software-systems/",
            
            
            
            
            
            "date_published": "2025-12-28T19:50:27-07:00",
            "date_modified": "2025-12-28T19:50:27-07:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/sc-25-recap/",
            "title": "SC'25 recap",
            "summary": null,
            "content_text": "  The annual SC conference was held last week, drawing over  16,000 registrants and 560 exhibitors  to in St. Louis, Missouri to talk about high-performance computing, artificial  intelligence, infrastructure, and science. It was my tenth time attending  in-person (12th overall), and as is always the case, it was a great week to  reconnect with colleagues, hear what people are worrying about, and get a  finger on the pulse of the now-rapidly changing HPC industry.Outside the SC'25 convention center on the only clear day of the week.Although every SC I've attended always felt a little different from the  previous year, this one felt quite different. Part of that results from my own  personal circumstances: this is the first year I attended as an employee of  VAST Data, and so the people with whom I met and the technical problems to  which I paid attention were certainly biased towards those most relevant to my  work. But the backdrop of the whole conference has also shifted. It's been  three SC conferences since ChatGPT came out, and it's now undeniable that AI  isn't simply on the horizon; it's shaping the field of HPC and scientific  computing. What used to be an argument of \"us vs. them\" is now more like \"them (and us?)\"  As has become tradition, I'm sharing some of my thoughts from the week with  the world in the hopes that someone finds this interesting and insightful.  I've roughly organized them into two areas big themes and the exhibition hall.Big themesTheme 1: The big number is losing its shineTop500The Gordon Bell PrizeFixing problems caused by the big numberTheme 2: HPC policy is becoming AI policyTheme 3: AI discourse is growing upAgentic workflowsData and agent-centric service infrastructureThe exhibit hallBy the numbersInteresting new technologyDell IR700HPE Cray GX5000Big themes  HPC has always been at the center of a tension between keeping things the same  (supercomputers are the most stable the day they are turned off) and pushing  the technological envelope (which is the fastest way to unlock new discovery).  The desire to push the envelope has always been a \"pull\" towards the future;  researchers first led with kooky ideas (like DAOS and Kokkos), and as those  ideas turn from research into production, they make new technologies (like  all-flash and AMD GPUs) accessible to scientists.  What hasn't historically happened, though, is a strong \"push\" towards the  future. Scientific HPC centers push themselves to justify building the next  big supercomputer, but it's been a given that there will always be another big  machine, so this push has been internal and gentle. Combined with the  not-so-urgent pull of HPC researchers, every center has gotten a new machine  every five years or so.  This is the year where it became clear to me that AI is now exerting a strong  push on the HPC industry--a shove even, forcing HPC centers around the world  to align themselves on an AI mission if they want to survive. All the  big-money HPC systems being announced this year are clearly being positioned  as AI-first and AI-motivated, and these announcements are going well beyond  simply peppering \"AI\" throughout the press release and otherwise acting as if  it was business-as-usual. This is the first SC where I saw scientists,  architects, and decision-makers being being forced to confront real tradeoffs  favor either HPC or AI, and they are beginning to choose AI.  This push-and-pull on HPC towards the future manifested in three big themes.  Theme 1: The big number is losing its shine  HPC has long organized itself around treating the big machine and the big  number as its top priority, and this is why the two largest HPC conferences of  the year honor the semiannual release of the Top500 list on their main stage.  However, this year felt like the first time that one number (that somehow  reflects \"performance\") dominated the conversation. Instead, the discourse was  more diffuse and discussed \"performance and x\" or \"the supercomputer and x.\"Top500  The place where this was most evident to me was at the  Top500 BOF, where the latest list was unveiled.  The biggest announcement was that Europe now has its first benchmark-confirmed  exascale system in JUPITER, which ran a full-system HPL at  1,000,184 TFLOPS  for two hours and seven minutes. However, JUPITER didn't get any stage time at  the BOF since, like Aurora, it actually debuted on a previous list with a  sub-exascale run. This run pushed it over the finish exascale finish line, but  if the Top500 list metadata is to be believed, the run used 100% of JUPITER's  5,884 nodes to break the barrier--a feat that is unlikely to be reproduced on  any production applications, since it is rare to have zero failed nodes in any  large-scale production environment.  So, while there was little fanfare for Europe in breaking the exaflops barrier  with its new big machine and big number, there were some big  announcements--one overt, and others more muted.  The biggest news was that the Top500 list is changing hands.  Whereas it has historically been controlled by three people--Jack Dongarra,  Horst Simon, and Erich Strohmaier--it will be transitioning to be  community-controlled under the stewardship of ACM SIGHPC. Dongarra, Simon, and  Strohmaier will still be on the steering committee under the ACM stewardship,  but this new governance structure opens the doors for new ideas to breathe new  life into the way systems are ranked and, more broadly, how \"performance\" is  meant to be interpreted from Rmax.  At present, the list (and related lists) are bound by rules that, in the  present day of reduced-precision accelerators, make little sense. For example,  using the Ozaki scheme within the LU decomposition is not allowed by Top500  despite the fact that it can produce the same answer with the same numerical  accuracy much faster than hardware FP64. And while the HPL-MxP benchmark does  allow solving the same problem using more creative methods, Strohmaier  highlighted a problem there too: it never dictated how to deal with multiple  levels of mixed precision until AIST broke the rankings. AIST ran HPL-MxP at  both 16-bit and 8-bit precisions, resulting in their ABCI 3.0 system  simultaneously ranking at #6 and #10.  These sorts of issues make it easy to question the value of leaderboards like  Top500 or HPL-MxP, as their definition of \"performance\" becomes increasingly  further divorced from how large supercomputers are really used. The past few  years have shown that there hasn't been the time or energy to get ahead of  these ambiguities amongst the three men maintaining the list, so transitioning  it to ACM will hopefully be a positive move that will give the list a chance  to be revitalized.  To their credit,  the incipient stagnation of the Top500 list was called out by  Strohmaier during his analysis of the list, acknowledging that \"growth has  tremendously slowed down compared to what it used to be\" and \"we don't have  proof of what is actually the reason for that:\"All the key highlights of this SC's Top500 list.China has stopped submitting, the AI and hyperscale providers really never  started submitting, and retired systems are being thrown off the list long  before they fall off the bottom. To me, this was a tacit acknowledgment that  the list does not have a bright future out to 2030 unless it is modernized to  be relevant to the way in which today's largest systems are actually being  used--which is not DGEMM.  The final surprising acknowledgment during Strohmaier's talk was that  the list is trailing the state of the art in hardware by  quite a bit. He pointed out that Blackwell systems are only now starting to  appear even though they've been shipping in volume for the better part of a  year. While he hypothesized that there is \"uneasiness\" about Blackwell in an  HPC context, the reality is that there are no Blackwells for HPC until the  Blackwell orders for hyperscale AI have been fulfilled. HPC is second in line,  and even then, the only Blackwells I could find on this year's Top500 list  were NVL8 configurations--not the NVL72 configurations that have been filling  up hyperscale datacenters like  Fairwater.  Strohmaier pointed out that Blackwell, by virtue of its HBM3e (vs. Hopper's  HBM3), is showing up higher on the HPCG list (which is a memory bandwidth  test) than on Top500 (which is an FP64 FLOPS test). He phrased this as  evidence that \"not everything is bad for the HPC community,\" but I would have  phrased my conclusion a little differently:    Blackwell is actually great for HPC, because most real workloads are    memory-bandwidth bound, not FLOPS bound. The fact that B200 offers similar    FP64 FLOPS at higher memory bandwidth means that real applications will get    higher effective use of those FP64 FLOPS.      Despite the above, Blackwell doesn't perform well on Top500 because HPL    doesn't reflect the reality that memory bandwidth is important. It follows    that HPL doesn't reflect the reality of real HPC applications. A Blackwell    system can be significantly better for real HPC applications than a    comparably sized Hopper system even though it may rank lower than Hopper on    Top500.      Blackwell isn't showing up in volume now because the HPC community is second    in line. The HPC community isn't uneasy as much as it is completely locked    out. The first NVIDIA-based exascale system debuted in November 2025 despite    its GPU being three years old, suggesting that if big Blackwell systems ever    appear on Top500, it'll happen in 2026-2027.    All of this is a roundabout way of showing that the big number--in this case,  the HPL score--no longer leads meaningful conversation around how useful a  system is for science.The Gordon Bell Prize  Another major indicator of the changing tide away from the big number was the  work that won this year's  Gordon Bell Prize. The winning  paper, titled \"Real-time Bayesian inference at extreme scale: A digital twin for tsunami    early warning applied to the Cascadia subduction zone,\" wasn't the typical case of running a huge simulation for a few hours and  reporting some result. Rather, it described a four-step workflow that  culminates in the desired insight popping out of a computation that runs  across only 128 nodes and completes in less than 0.2 seconds. Furthermore, the  hero run part could be decomposed into trivially parallel components, allowing  the bulk of the computation to be geographically distributed across HPC  centers or GPUs spread across on-prem and cloud providers.  My understanding of the work is that there was a massive \"offline\" computation  to precompute a few key matrices (Phase 1) followed by two shorter offline  steps that turn those matrices into the core of the digital twin. The last  step, which was \"online\" and designed to be computed in real-time, could then  take this core and solve the input problem with extremely low latency. This  workflow front-loads a hero run in such a way that, if an earthquake were to  occur, the risk of tsunami could be calculated in less than a second using  only modest compute resources and the precomputed core.  The authors eschewed methods that generated tons of FLOPS in favor of methods  that were less FLOPS-efficient but got to the answer faster. In the authors'  own words:    As shown in Fig. 7, higher FLOP/s does not necessarily lead to faster time-to-solution. On    MI300A nodes of El Capitan, the best-performing    implementation, Fused PA, achieves a lower percentage (5.2%) of theoretical    peak FLOP/s than Fused MF (5.5%) but is faster.    Interestingly, the hero computation here was embarrassingly parallel(ish) as  well; in their demonstration run, the hero run (Phase 1) was broken into 621  independent calculations each requiring 128 nodes (512 A100 GPUs) for about an  hour. Because they are independent, these tasks could be parallelized across  multiple HPC centers as well, and my understanding of the data volumes  involved are modest; Phase 1 would require a single shared copy of the input  mesh (a hundred GiB?) per HPC center, and each of the 621 tasks would output  around 8 GiB which would have to be copied back.  While I don't understand the mathematics behind this work, the paper took what  would've been a huge exascale-class mathematical problem (\"10 years on a  sustained 1 EFLOP/s machine\") and reformulated it into a workflow that solves  the problem faster and more usefully. Instead of brute-forcing the problem  with a big supercomputer, they split it into separate offline and online  parts, and this naturally allowed the most computationally expensive part to  be geographically distributable.  This work surrendered the need for a single big machine, and it didn't produce  a big-number result. But it did win the Gordon Bell Prize, again signaling  that the HPC community is beginning to look beyond performance-only and think  about awarding innovation according to outcomes, not just FLOPS.  The talk for this paper can be viewed  here in the SC25 Digital Experience.  Fixing problems caused by the big number  Most of my perception around the HPC community beginning to de-emphasize the  singular big machine or big number arose from organic interactions I had with  colleagues and customers though. It's hard to summarize how these  conversations went, but the  Lustre Community BoF  is a good example of what I saw elsewhere.  Lustre has long been the gold standard in high-performance parallel I/O in the  HPC community because it was designed from day one to deliver high bandwidth  above all else. As a result, Lustre already has the big number solved in many  ways, and events like the Lustre BOF are a great case study in what it looks  like for a performance-first technology to be pushed into adapting to deliver  more than just a big number.  First, the ever-innovative Stéphane Thiell from Stanford discussed the process  and tooling he developed to enable online capacity expansion of a Lustre file  system. The basis for it was a distributed, fault-tolerant tool he developed  that uses redis, lfs find, and lfs migrate to manage the state of file  migrations across Lustre servers as the file system is rebalanced. While a  part of me thought this was a great tool that would be super helpful for many  others, another part of me was kind of horrified.  Maybe I've been spoiled by working in hyperscale and AI these past three  years, but online capacity expansion and rebalancing is a built-in capability  of all distributed storage systems these days. All the major cloud object  stores do this, as do all modern parallel file systems including Quobyte,  VAST, and WEKA. Of course, none of these modern systems are as efficient (on a  per-CPU core or per-SSD basis) as Lustre at delivering peak performance. But  Stéphane's talk made me realize the price that's paid for this great  performance.  Andreas Dilger and others went on to talk about Lustre futures, and as they  were speaking, I noticed that nobody was talking about performance  improvements to Lustre. Rather, feature development was focused on catching up  in every other dimension--data governance, reliability, manageability, and  others. For example, Andreas talked a bit about the upcoming \"multi-tenancy\"  features coming to Lustre:It's a lot of work to retrofit multitenancy into a performance-first file system.I put “multi-tenancy” in quotes because these changes really representtrying to back into a security posture that is fundamentally different from theone that Lustre was designed around. In the pursuit of performance, Lustre (aswith most other HPC technologies) was designed assuming that security wassomeone else’s problem. By the time someone could log into a system that couldmount a Lustre file system, they had already been authenticated, and it was upto the OS on each compute node to authorize any interactions with Lustre itself.This is the “implicit trust” model.  The problem, of course, is that the rest of the world has adopted a \"zero  trust\" model which makes many things (except performance!) generally easier.  Compliance is easier when the system assumes that everything is encrypted as a  default and key management can be delegated to a third party. Because Lustre  didn't do this from the outset, it is going through this process of  retrofitting encryption in various places and using a mixture of nodemaps,  UID/GID maps, and shared secrets to patch over all the places where trust was  fundamentally implicit.  Later on in the BOF, panelists acknowledged (some half-heartedly) that  manageability of Lustre was a barrier. One panelist admitted that it took five  years of work to almost get to the point where a Lustre update can be done  without crashing applications. Another panelist said that multitenancy in  Lustre is easy if you follow a million steps, and that his company  was developing script-based ways to simplify this. While the idea of using  scripts to simplify operations is not bad, from a secure supply chain  standpoint, relying on third-party bash scripts to enable features required  for legal compliance is horrifying.  I don't mean to pick on Lustre alone here; other HPC technologies such as  InfiniBand, Slurm, and DAOS are facing the same reality: retrofitting modern  requirements like security and manageability into architectures that  prioritized performance and scalability over everything else are now going  through similar contortions to meet modern requirements around data  governance. For those HPC centers who do not have to worry about compliance  (which is most of open-science computing), these technologies will continue to  be just fine.  However, the  successes of these modern file systemsacross leading HPC centers  and the proliferation of alternative technologies such as  Kubernetes-based HPC and  MRC over Ethernet  tells me that HPC coming around to the idea that marginal increases in  performance are no longer worth missing out on factors that weigh heavily on  day-to-day operations like manageability, reliability, and flexibility.  Theme 2: HPC policy is becoming AI policy  Some of the biggest news at SC was not actually showcased at the conference  despite being what many people wanted to talk about in side conversations: HPC  policy is rapidly becoming AI policy, resulting in a slew of huge (but poorly  defined) \"public-private partnerships.\"  As a bit of background, the Oak Ridge Leadership Computing facility announced  its next system, Discovery, in late October--this was the result of a  \"typical\" supercomputer procurement process that  first came into the public eye in 2023. However, the Discovery announcement also included mention of a smaller  system, Lux, which will \"leverage the Oracle Cloud Infrastructure (OCI)\" (whatever that means) to provide earlier access to AMD MI355X GPUs ahead of  Discovery's full-scale deployment.  Then, two days later, Argonne National Laboratory announced a  similar arrangement with Oracle Cloud and NVIDIA  to deliver a small (Lux-sized) GPU supercomputer named Equinox, followed by a  much-larger 100,000-GPU supercomputer named Solstice. Neither Equinox nor  Solstice are attached to a \"typical\" supercomputer procurement; the follow-on  to Aurora, to be named  Helios, is  still in planning  and will be deployed in 2028. This strongly suggests that, whatever  \"public-private partnership\" means to the DOE, it is not the same as the  typical leadership computing systems; it is its own AI-centric program.  At SC itself, Evangelos Floros (EuroHPC's head of infrastructure) also  mentioned the \"need for public-private partnerships\" to realize EuroHPC's goal  of building \"AI Gigafactories\" with \"100,000 advanced AI processors\" across  Europe.\"Need for public-private partnerships\" to fund AI factories is recognized by EuroHPC too.Again, what exactly this \"public-private partnership\" model entails in Europe  was never really defined.  What was clear is that both American and European efforts are declaring the  need to build massive (100K+ GPU) supercomputers for AI, the traditional HPC  centers will be the public stewards of them, and \"public-private partnerships\"  are the only way to realize them since governments alone cannot foot the bill.  The Top500 BOF also included a short, awkward talk by Rick Stevens titled \"The  DOE AI Initiatives\" that amounted to Stevens saying he had nothing to say.  What really happened, I suspect, is that DOE's new \"Genesis Mission,\" which was announced the week after the SC conference, was a week  late and therefore couldn't be discussed as originally planned. If Stevens had  been able to describe the Genesis Mission, though, I'm sure he would've also  described \"public-private partnership\" as a key aspect, since the same  language is used in the  Executive Order that established Genesis. And I'm sure his description would've been no clearer about what this  really means than what EuroHPC or the OCI/DOE descriptions have stated.  Most revealing was my observation that, even outside of the proper conference  program, nobody really knew what any of this meant. I talked to plenty of my  colleagues from both government HPC and hyperscale cloud organizations, and  the only consistent message was that there aren't many concrete facts backing  up the the press releases right now. It appears that these partnerships were  brokered far outside the usual channels that large supercomputer procurements  are normally done, and the people in charge of actually delivering on the  promises of the press releases are still figuring out what is possible.  Connecting the dots between Lux/Equinox/Solstice, Genesis, and a recent  RFI  and  RFP  from DOE to allow  hyperscalers to build AI factories on federal land, it appears that what is happening is...    The DOE has a bunch of land that is adjacent to the National Labs that is    undeveloped but has the infrastructure to support massive AI factories.    Specifically named is a 110-acre parcel at Argonne that can accommodate up    to 1 GW \"AI data park,\" and a 100-acre parcel at Oak Ridge with up to 800    MW. These details were disclosed in    an RFI they issued earlier in the spring.      The    Solstice press release    specifically said that DOE envisions \"shared investments and shared    computing power between government and industry.\" Given the RFI/RFP were    about land leases, these public-private partnerships may involve splitting    the costs of space/power/cooling (the land and infrastructure being leased)    and the capital/operations (the supercomputer cloud services being built)    between the Labs and Oracle.    A potential model for operations is that cloud providers are allowed to build  and operate commercial AI cloud services adjacent to the DOE HPC facilities in  exchange for the DOE Genesis Mission being entitled to some of those AI cloud  capabilities. Exactly how much supercomputing resources hyperscalers like OCI  would give to DOE, and exactly how much it would cost the DOE Labs to serve as  landlords, is probably still undefined. But seeing as how power is the single  biggest limiter in AI these days, I expect this model will only spread costs  around, not actually lower them.  If this is indeed how Genesis plays out, this would establish a bizarre new  way for the government to acquire HPC (or AI) capabilities that completely  sidesteps the standard procurement model. Instead of plunking down a hundred  million dollars a year to finance a new leadership supercomputer, we might be  moving into a world where the Labs plunk down a hundred million dollars a year  to cover the costs of power, space, and cooling for a cloud provider. And  instead of owning a leadership supercomputer, these national HPC facilities  wind up consuming HPC (well, AI) resources from cloud providers--hopefully at  a cost that reflects the fact that the cloud providers are also profiting from  cycles being sold off of these machines to commercial AI customers.  But again, this is all speculation based on the consistencies I heard  throughout the conference and the experience I had trying to build these sorts  of partnership with the HPC community while I worked at Microsoft. I may be  right, or I may be wildly wrong. There are probably only a handful of people  in the world with a clear idea of what these partnerships are meant to look  like right now, and they are all way above the heads of the people at the HPC  centers who will be tasked with executing on the vision.  Selfishly, I am also left with a bit of heartburn over all of this news. I put  a lot of personal time and energy into giving the HPC community the  information it needed to feel comfortable about partnering with hyperscale AI  infrastructure providers while I was at Microsoft, and it often felt like a  Sisyphean task. Within months of me giving up and moving on from my career at  a cloud provider, seeing a complete reversal of policy from the leadership HPC  folks--and to see the \"other guy\" in pole position--is a bit of a slap in the  face.  I also couldn't help but notice that the cloud provider in all the headlines  in the US didn't seem to demonstrate a very strong and unified presence at SC  this year. Comically, they didn't even use their own brand's colors for their  booth on the exhibit floor. And the color scheme they did use left no room for  Oak Ridge's Lux system, which will be AMD-based, to be showcased.Oracle's booth at SC25. Their brand color is red, not green. Or so I thought.Though I may have read too much into this, it feels like these public-private  partnerships are not necessarily composed of equal partners with equal levels  of commitment.  More broadly, I left the conference concerned that the discourse happening  around these cloud-HPC/AI integrations--at least in the US--appears to have  regressed compared to where it was when I worked at Microsoft. Many of the  things we had to figure out years ago (cybersecurity models, impacts on jobs  at the HPC centers) seem to have reset to zero. And sidestepping the  procurement processes for leadership computing to enable these public-private  partnerships will either require significant new funding (of which Genesis  provides none; the executive order as-written appears to recolor existing  money) or robbing Peter (the budget funding the next generation of leadership  HPCs) to pay Paul (the cloud providers serving up compute resources for AI).  As a result, I can envision a future where all of the money that used to fund  leadership computing for science becomes money to fund commercial AI  factories, resulting in a slow evaporation of the LCFs as their HPC  capabilities shrink in size and relevance.  Though there's lots more to be said on this topic, it's all based on  conjecture. So, maybe the best thing to do is quietly wait and see.  Theme 3: AI discourse is growing up  This was the first SC where it felt like the discourse around AI's role in the  future of scientific computing actually carried some substance. Whereas  previous years saw talk that mostly revolved around basic ideas like \"do LLMs  hallucinate too much?\" or \"can ChatGPT write MPI code?,\" I sat in on a number  of interesting talks and conversations that skipped the question of \"is AI  useful?\" and went straight to \"this is how AI is proving useful to us.\"  Maybe it's related to the previous theme: HPC money is becoming AI money, so  AI research is becoming required to stay afloat. Or maybe it's because 2025  has been the year of agentic AI, and agents allow LLMs to be integrated much  more surgically into complex workflows. Or maybe confirmation bias led me to  sit in sessions and talk with people who are at the frontier of applying AI to  scientific discovery. Whatever the case may be, I was glad to hear so much  discussion from researchers around the importance of all the connective tissue  required to operationalize AI in scientific computing.Agentic workflows  A great example of this was the  1st International Symposium on Artificial Intelligence and Extreme-Scale    Workflows, which happened on Friday. One of the invited speakers, Dr. Katrin Heitmann,  connected a lot of dots in my head with a talk she gave on how massive-scale,  physics-based simulation workflows can benefit from agentic AI.Heitmann's vision on how agentic approaches can augment (but not replace) humans in complex scientific workflows.The crux of the challenge faced by most massive-scale simulation (like  HACC, the cosmology code  for which she is famous) is that they generate massive amounts of data. The  most recent HACC run  generated hundreds of terabytes of compressed data per checkpoint and over a  hundred petabytes of data in the end; this cosmological simulation serves as a  reference dataset from which downstream cosmological research can draw when  exploring targeted questions. The challenge, of course, is finding relevant  pieces of the simulated universe from amidst a hundred petabytes of raw data.  Dr. Heitmann's premise is that agents and tools have very specific scopes and  capabilities, and researchers have control over which of these tools they wish  to use. However, they can hand off these tools to an agentic workflow to let  it autonomously sift through all of the data, looking for specific features  within the simulated universe that are relevant. A specific example she gave  was the process of examining 500 million galaxy clusters; with an agentic,  AI-driven approach, a postdoc was able to interactively sift through these  objects without examining each one individually. For truly interesting  objects, a separate agent could go search the literature and provide an  explanation as to why it may be interesting, absolving the postdoc from having  to make round trips between the dataset and external literature.  That all said, it was clear from this talk (and others) that integrating  agentic AI into scientific inquiry is still in its early days. But what I  appreciated about this talk (and the entire workshop) is that it sidestepped  pedestrian questions about trustworthiness by acknowledging that the goal  isn't full autonomy, but rather, enabling researchers to do things faster.  There is still a human at the start and the end of the workflow just as there  always has been, but agents can reduce the number of times a human must be in  the loop.  Data and agent-centric service infrastructure  Even when AI wasn't the main topic of discussion, it was clear to me at this  SC that AI is influencing the way researchers are thinking about the  infrastructure surrounding supercomputers. A great example of this was the  keynote at the PDSW workshop,  given by the ever-insightful  Dr. Rob Ross, where he offered a retrospective on the work his team has    done over the last two decades, what he felt they got right, what they missed, and what's ahead.  Towards the end of his presentation, he made the case that \"science is  increasingly multi-modal.\" But rather than talk about multimodality in the AI  sense, he was emphasizing that there's more to scientific computing than  performance:Domain science, provenance, search, and resilience are equal partners to performance in scientific computing.Taken at face value, this slide positions performance on equal footingwith domain science, provenance, findability, and his argument was that we’vemoved beyond the world where the only storage problem that HPC faces ischeckpointing. Just as Dr. Heitmann would say on Friday, Dr. Ross’ argument wasthat the increasing volume of scientific data coming out of both exascalesimulation and scientific instruments is driving the field towards moreautomation. And with automation comes a greater need to understand dataprovenance–after all, if automation produces a surprising result, a humanultimately has to go back and understand exactly how the automation generatedthat result.  He also point out that in this coming world of automation-by-necessity,  infrastructure itself might have to be rethought. After all, traditional  technologies like parallel file systems were designed to make the lives of  human researchers easier; when the primary consumer of data becomes AI agents,  not humans, there may be better ways to organize and expose data than through  files and directories. A human might repeatedly cd and ls to find a specific  dataset on a file system, whereas an agent use a query a flat index to find  the same data in a single step.  At the end of the same PDSW workshop, I was fortunate enough to contribute to  a panel  where many of these same themes--how will data systems change as AI plays a  greater role in scientific discovery--were discussed. Although we touched on a  lot of topics, what stuck with me was a general acknowledgment that, while HPC  has always talked about data management and provenance as being important,  they were always treated as a \"nice to have\" rather than a \"must have.\"  However, as was echoed across many presentations (including the two I  described above), governance and provenance are now becoming non-negotiable as  larger datasets drive us towards AI-driven automation.  Regardless of what you think about AI's ability to accelerate scientific  discovery, I left SC with the feeling that AI is forcing the HPC community to  grow up with regards to how seriously it takes data management:    The size and velocity of datasets generated by simulation or experiment is    growing beyond any single person's ability to analyze it by hand. The    complexity of these data are also making it harder to develop    herustics-based or analytical approaches to combing through all of it.      The best path forward to understanding these data is through AI (via    purpose-built models for analysis) or AI-driven data exploration (via    autonomous, agentic workflows).      Automation or autonomous workflows will always act under authority delegated    to them by human researchers, meaning there is a growing need to be able to    inspect how these workflows arrived at the conclusions they generate.      Understanding how an answer was achieved requires significantly better data    management features such as governance, provenance, and auditability. A    result is ultimately only useful if a human can trust it, and that trust    comes from understanding which data informed that conclusion, how that data    was created, and how it was modified over time.    Put differently, checkpointing was the main concern of I/O research because  I/O performance was the first scalability issue that scientific computing ran  into as supercomputers and scientific instruments got bigger. However, we're  now at a point where issues ancillary to performance have reached the limits  of scalability. Dr. Ross's multi-modal slide indicate that provenance,  indices/search, and resilience are some examples of these new barriers, but  there are plenty more as well.  In a sense, this theme is the opposite side of the same coin as the first  theme I discussed--that the big number is losing its shine. The hardest  questions going forward aren't the obvious ones about scaling performance;  they are about scaling everything else to keep up. AI seems to be the  technology that has cleared a path to these data management hurdles, but the  benefits of adopting strong data management practices and systems will extend  far beyond the reach of just enabling AI-based automation.The exhibit hall  The exhibit hall has long been one of my favorite parts of attending SC  because it's a great way to get a feeling for what technologies and vendors  are hot, where the innovation is trending, and what sorts of commercial  problems are worth solving. Every year I feel like I have less and less time  to walk the exhibit hall though, and the layout and composition of this year's  exhibition meant I only saw a small fraction of what I wanted to see in the  few days it was up.  The most common comment I heard about the exhibit this year is captured in  Doug Eadline's article,  SC25 Observations: More Pumps than Processors  (which is well worth the read!). The same commentary was repeated throughout  the OCP conference in October as well, suggesting that there is a lot of money  to be made (or at least the prospect of money) in helping datacenters get  outfitted for the liquid cooling demanded by the next generation of  large-scale GPU infrastructure. However, I found the overwhelming amount of  space devoted to liquid cooling companies acutely problematic at SC25 this  year for two reasons:Most SC attendees have nothing to do with liquid cooling. A    colleague of mine who operates supercomputers for the energy sector asked    one of these big liquid cooling vendors what he could do to actually engage    with them. After all, he doesn't buy liquid cooling infrastructure; he buys    whole supercomputers that come with heat exchangers and CDUs that are    integrated into the solution. The vendor had no good answer, because the    reality is that the typical supercomputer user or buyer has no say over what    piping, coolant, or exchangers are used inside the machine itself. The whole    point of buying an integrated supercomputer is to not have to deal with that    level of details.  These liquid cooling vendors soaked up a ton of floor space. A few of these physical infrastructure providers had massive (50x50)    booths sprinkled across the exhibit hall. Combined with the fact that the    average SC attendee has nothing to do with liquid cooling meant that the    booths that were more likely to be relevant to a typical attendee were much    further apart than they had to be.    The end result was that the exhibit hall was absolutely gargantuan and yet  information-sparse. In fact, this year saw a secondary exhibit hall in the old  football stadium serve as overflow space, because the entire primary exhibit  hall was full. What's worse is that this overflow space was (as best as I  could tell) completely disconnected from the main hall, and the only time I  ever saw it was from the dining area used to serve lunch for the tutorials.The exhibit hall's overflow space being set up in the former football stadium.I would’ve been furious if if I had been stuck with a booth in thisoverflow space, because I can’t imagine the foot traffic in there was very high.I personally couldn’t even find the entrance to this second exhibition area inthe few hours I had to look for it.  I can't help but think the SC organizers leaned far too much into booking up  as much space (and therefore exhibitor dollars) as possible without thinking  about the dilutive effects of having such a massive vendor count. Some vendors  definitely benefitted from having a good location near one of the hall  entrances, but I also heard a nontrivial amount of grumbling around how little  traffic there was at some of the big booths. It wouldn't surprise me if there  was a contraction of the HPC mainstays at SC26.By the numbers  Rather than rely solely on anecdotes though, it's also fun to take a  quantitative look at the changes in exhibitors relative to last year. Since I  spent the time figuring out how to generate tree maps for my SC24 recap last  year, I figured I should re-run the same analysis to compare SC25 to SC24.  Of the biggest booths who were exhibiting for the first time this year, it  should be no surprise that the two biggest new entrants were Danfoss (liquid  cooling infrastructure) and Mitsubishi Heavy Industries (gas turbines and  other large-scale infrastructure): New exhibitors with the largest booths.Of the other top new exhibitors, some (Solidigm, Sandisk, C-DAC, MinIO, and  University of Missouri Quantum Innovation Center) were quite relevant to the  typical SC attendee. Arm was also back after having skipped SC24. But there  were scores of new exhibitors whose services and products seem much more  relevant to very niche aspects of physical datacenter infrastructure.  Of the exhibitors who didn't show up to SC25 but had big booths at SC24, there  was a diverse mix of markets:Vendors who didn't show up to SC'25 but had big booths at SC'24.Sadly, higher ed and government popped up on this list (see  Doug Eadline's take on this for more). A bunch of datacenter infrastructure providers also vanished, including  Valvoline and Boundary Electric; this suggests that some of the top new  vendors of this year (Danfoss, Mitsubishi) may similarly vanish entirely next  year after realizing that SC isn't really their crowd. But I was also  surprised to see some big names in AI vanish; Iris Energy (IREN) is a GPU  cloud provider that just inked a multi-billion dollar deal with Microsoft;  Ingrasys manufactures much of the world's GB200 NVL72 infrastructure; Groq,  Sambanova, and SambaNova also inexplicably vanished.  Perhaps more interesting are the top growers; these vendors exhibited both  last year and this year, but went significantly larger on their booth sizes:Biggest increases in booth size at SC'25 vs. SC'24.Legrand, which provides datacenter infrastructure bits, likely grew as a  result of it acquiring USystems and merging USystems' booth with Legrand's  booth this year. The other big booth expansions are mostly household names  though; Gates, EBARA, and GRC are cooling vendors that the typical SC attendee  can't do much with, but the others are organizations with whom a researcher or  HPC datacenter operator might actually talk to.  Finally, the top contractions in booth space are a mix of service providers,  HPC facilities or research centers, and component suppliers:Biggest decreases in booth size at SC'25 vs. SC'24.Of the biggest vendors who downsized, Carahsoft is a component reseller and  service provider, Stulz is a liquid cooling company, HLRS is a German  supercomputer center, and Viridien is an HPC services company that primarily  serves the energy sector. It is surprising to see AWS shrink while Microsoft  grew, and it is doubly surprising to see Oracle shrink when it's at the center  of the biggest HPC deployment news of the season. Given that these booth sizes  are chosen a year in advance, this may speak to how unexpected the turn of  events were that resulted in Oracle carrying the cloud services end of DOE's  big public-private partnerships.Interesting new technology  For reasons I'll discuss later, I didn't have much time to walk the exhibit  hall floor. Combined with the fact that everything was so spread out and  diffuse, I just didn't get a great sense of what interesting new technology  was being introduced this year beyond what tended to stick out. And amidst all  the giant CDUs and liquid cooling infrastructure, it was hard for anything to  stick out except really big compute cabinets.Dell IR700  Dell's booth had a fully loaded IR7000 rack on display (as they did at  GTC earlier in the year) with 36 GB200 NVL4 sleds. At 50OU high (almost eight feet tall), this thing  is physically huge:Dell's 50OU IR7000 rack, fully loaded. This is what TACC Horizon will be built from.Unlike the version they had on display at GTC though, this one had both the  front door and a full rear-door heat exchanger installed:HUGE rear-door heat exchanger on the back of the Dell IR7000 rack.What's notable about this platform is that we now know that it is the basis  for both  TACC's upcoming Horizon system  (which will have  28 of these fully loaded racks) and  NERSC's upcoming Doudna system  (which will have Vera Rubin rather than Blackwell). This rack was nominally  designed for hyperscale AI and is the basis for Dell's GB200 NVL72 (XE9712)  deployments at places like CoreWeave and xAI, which means that it'll be  thoroughly tested at scale long before TACC or NERSC have it up and running.  This is the opposite of what has historically happened: before AI, it was  usually government HPC that had to debug new rack-scale architectures before  industry would touch it.HPE Cray GX5000  However, government HPC will still have a chance to debug a new supercomputing  platform in the recently announced  Cray GX (formally  called \"the HPE Cray Supercomputing GX platform\"), which is the successor to  the current Cray EX platform. This is the platform that the  Discovery supercomputer  at OLCF will use, and HPE had a CPU-only blade (Cray GX250) and a rack mockup on display at SC:HPE's new GX blade form factor. This one appears to be the GX250, the 8-socket CPU-only blade.It's hard to tell the size of this blade from the photo, but if you look at  the relative size of the CPU socket and the DIMM slots, you can get a sense of  how physically massive it is--it's like a coffee table. It also isn't  perfectly rectangular; Cray decided to put this unusual protrusion on the  front of the blades which is where the four NICs and eight E1.S SSDs are  housed:A look at the side of the Cray GX blade's \"nose\" showing the side-mounted NIC ports.This nose(?) adds more surface area to the front of the rack, and it makes  more sense when you see a rack full of these nodes. HPE had a full GX5000 rack  with mocked-up cardboard nodes in their booth as well:Fully loaded GX5000 rack. The nodes were cardboard, but pretty nice cardboard.By having the NIC ports (which are Slingshot 400) face the sides of the rack  rather than stick out the front, the bend radius of all that copper doesn't  have to be quite as dramatic to route it along the sides of these node noses.  And unlike previous Cray designs, there's also no midplane or backplane that  connect the nodes in a rack to the rack-local switches; everything connects  through discrete copper or optical cables.  At the center of the rack is a liquid-cooled switch chassis, and each rack can  support either 8-, 16-, or 32-switch configurations. Each switch is a 64-port  Slingshot 400 switch, and I think the premise is that a single GX5000 rack is  always exactly one dragonfly group. If you want a smaller group, you use a  switch chassis with fewer switches.  Interestingly, this GX will also support non-Slingshot Ethernet and XDR  InfiniBand switches. Given that both XDR InfiniBand and 800G Ethernet are  shipping today and have twice the bandwidth that Slingshot 400 will have when  it starts shipping in a year, perhaps the Slingshot 400 option is just a  stopgap until HPE's investments in Ultra Ethernet result in a product. The  lack of a network backplane in the rack also makes it easier for the rack to  accommodate the non-dragonfly topologies that would be required for InfiniBand  or Ethernet.  The rear of the rack is remarkably unremarkable in that it simply contains a  rear bus bar and the liquid cooling manifolds and mates. In this sense, the  rack looks very  OCP-like; the boring stuff is in the back, everything exciting is serviced from the  front, and the rack itself is passive plumbing. Like any OCP ORv3 rack, power  shelves slot in just as server blades do, and they use the same liquid cooling  infrastructure as the rest of the rack. They power the bus bar, and the blades  and switches draw from the same bus bar.  Compared to an ORv3 rack though, these GX racks are wider and shorter. The  width probably offers more flexibility for future NVIDIA or AMD GPU boards,  but I was surprised that Cray didn't go ultra tall like Dell's 50OU IR7000. I  was also surprised to hear that Cray is launching GX with a 400 kW cabinet  design; power appears to already be a limiting factor in the nodes launching  with GX. A single 400 kW GX rack can support40 CPU-only blades (81,920 cores of Venice)28 AMD Venice+MI430X blades (112 GPUs)24 NVIDIA Vera+Rubin blades (192 GPUs)  For reference, the demo GX5000 rack pictured above had only 29 blades and 16  switches. I assume that fitting 40 blades into the rack requires using the  smallest dragonfly group possible.  On the cooling front, the GX5000 rack will launch with support for the same  1.6 MW CDUs as the current Cray EX platform. I heard talk of a neat sidecar  CDU option as well, but the person with whom I spoke at the HPE booth said  that would come a little later.  Overall, I was surprised by how un-exotic the new Cray GX platform is compared  to what the AI world has been doing with ORv3 racks. The fact that Cray and  Dell's designs are more similar than different suggests that the HPC/AI world  is converging on a place where the future is uncertain, and flexibility is  more important that highly engineered racks that optimize for very specific  nodes and networks. It also suggests that the real value of buying Cray is  higher up the stack; liquid cooling, power delivery, and rack integration is  becoming commoditized thanks to AI.  I was also surprised that Cray's next-generation design is not obviously  superior to what the hyperscale community is designing. Whereas the GX rack  caps out at 400 kW, Dell's will allegedly scale up to 480 kW. That said,  today's IR7000 racks shipping for Horizon are only 215 kW (for GPU racks) and  100 kW (for CPU-only racks) according to a talk given by Dan Stanzione:The physical configuration of TACC's upcoming Horizon supercomputer.So until the final specifications for the Rubin GPU are released, I suspect we  won't know whether Cray still leads the pack in terms of compute density, or  if Dell made the better bet by aligning its supercomputing platform on a  standard OCP rack design.",
            "content_html": "<p>  The annual SC conference was held last week, drawing over  <a href=\"https://www.hpcwire.com/2025/11/19/sc25-observations-more-pumps-than-processors/\">16,000 registrants and 560 exhibitors</a>  to in St. Louis, Missouri to talk about high-performance computing, artificial  intelligence, infrastructure, and science. It was my tenth time attending  in-person (12th overall), and as is always the case, it was a great week to  reconnect with colleagues, hear what people are worrying about, and get a  finger on the pulse of the now-rapidly changing HPC industry.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Outside the SC'25 convention center on the only clear day of the week.</figcaption></figure></div><p>Although every SC I've attended always felt a little different from the  previous year, this one felt quite different. Part of that results from my own  personal circumstances: this is the first year I attended as an employee of  VAST Data, and so the people with whom I met and the technical problems to  which I paid attention were certainly biased towards those most relevant to my  work. But the backdrop of the whole conference has also shifted. It's been  three SC conferences since ChatGPT came out, and it's now undeniable that AI  isn't simply on the horizon; it's shaping the field of HPC and scientific  computing. What used to be an argument of \"<a href=\"https://blog.glennklockwood.com/2024/05/isc24-recap.html#section11\">us vs. them</a>\" is now more like \"them (and us?)\"<span></span></p><p></p><p>  As has become tradition, I'm sharing some of my thoughts from the week with  the world in the hopes that someone finds this interesting and insightful.  I've roughly organized them into two areas big themes and the exhibition hall.</p><ul style=\"text-align: left;\"><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#big-themes\">Big themes</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#theme-1-the-big-number-is-losing-its-shine\">Theme 1: The big number is losing its shine</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#top500\">Top500</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#the-gordon-bell-prize\">The Gordon Bell Prize</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#fixing-problems-caused-by-the-big-number\">Fixing problems caused by the big number</a></li></ul></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#theme-2-hpc-policy-is-becoming-ai-policy\">Theme 2: HPC policy is becoming AI policy</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#theme-3-ai-discourse-is-growing-up\">Theme 3: AI discourse is growing up</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#agentic-workflows\">Agentic workflows</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#data-and-agentcentric-service-infrastructure\">Data and agent-centric service infrastructure</a></li></ul></li></ul></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#the-exhibit-hall\">The exhibit hall</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#by-the-numbers\">By the numbers</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#interesting-new-technology\">Interesting new technology</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#dell-ir700\">Dell IR700</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#hpe-cray-gx5000\">HPE Cray GX5000</a></li></ul></li></ul></li></ul><h2 id=\"big-themes\">Big themes</h2><p>  HPC has always been at the center of a tension between keeping things the same  (supercomputers are the most stable the day they are turned off) and pushing  the technological envelope (which is the fastest way to unlock new discovery).  The desire to push the envelope has always been a \"pull\" towards the future;  researchers first led with kooky ideas (like DAOS and Kokkos), and as those  ideas turn from research into production, they make new technologies (like  all-flash and AMD GPUs) accessible to scientists.</p><p>  What hasn't historically happened, though, is a strong \"push\" towards the  future. Scientific HPC centers push themselves to justify building the next  big supercomputer, but it's been a given that there will always be another big  machine, so this push has been internal and gentle. Combined with the  not-so-urgent pull of HPC researchers, every center has gotten a new machine  every five years or so.</p><p>  This is the year where it became clear to me that AI is now exerting a strong  push on the HPC industry--a shove even, forcing HPC centers around the world  to align themselves on an AI mission if they want to survive. All the  big-money HPC systems being announced this year are clearly being positioned  as AI-first and AI-motivated, and these announcements are going well beyond  simply peppering \"AI\" throughout the press release and otherwise acting as if  it was business-as-usual. This is the first SC where I saw scientists,  architects, and decision-makers being being forced to confront real tradeoffs  favor either HPC or AI, and they are beginning to choose AI.</p><p>  This push-and-pull on HPC towards the future manifested in three big themes.</p><h3 id=\"theme-1-the-big-number-is-losing-its-shine\">  Theme 1: The big number is losing its shine</h3><p>  HPC has long organized itself around treating the big machine and the big  number as its top priority, and this is why the two largest HPC conferences of  the year honor the semiannual release of the Top500 list on their main stage.  However, this year felt like the first time that one number (that somehow  reflects \"performance\") dominated the conversation. Instead, the discourse was  more diffuse and discussed \"performance and x\" or \"the supercomputer and x.\"</p><h4 id=\"top500\">Top500</h4><p>  The place where this was most evident to me was at the  <a href=\"https://sc25.conference-program.com/presentation/?id=bof117&amp;sess=sess409\">Top500 BOF</a>, where the latest list was unveiled.</p><p>  The biggest announcement was that Europe now has its first benchmark-confirmed  exascale system in JUPITER, which ran a full-system HPL at  <a href=\"https://mastodon.social/@andih/115566907716591104\">1,000,184 TFLOPS</a>  for two hours and seven minutes. However, JUPITER didn't get any stage time at  the BOF since, like Aurora, it actually debuted on a previous list with a  sub-exascale run. This run pushed it over the finish exascale finish line, but  if the Top500 list metadata is to be believed, the run used 100% of JUPITER's  5,884 nodes to break the barrier--a feat that is unlikely to be reproduced on  any production applications, since it is rare to have zero failed nodes in any  large-scale production environment.</p><p>  So, while there was little fanfare for Europe in breaking the exaflops barrier  with its new big machine and big number, there were some big  announcements--one overt, and others more muted.</p><p>  The biggest news was that <strong>the Top500 list is changing hands</strong>.  Whereas it has historically been controlled by three people--Jack Dongarra,  Horst Simon, and Erich Strohmaier--it will be transitioning to be  community-controlled under the stewardship of ACM SIGHPC. Dongarra, Simon, and  Strohmaier will still be on the steering committee under the ACM stewardship,  but this new governance structure opens the doors for new ideas to breathe new  life into the way systems are ranked and, more broadly, how \"performance\" is  meant to be interpreted from Rmax.</p><p>  At present, the list (and related lists) are bound by rules that, in the  present day of reduced-precision accelerators, make little sense. For example,  using the Ozaki scheme within the LU decomposition is not allowed by Top500  despite the fact that it can produce the same answer with the same numerical  accuracy much faster than hardware FP64. And while the HPL-MxP benchmark does  allow solving the same problem using more creative methods, Strohmaier  highlighted a problem there too: it never dictated how to deal with multiple  levels of mixed precision until AIST broke the rankings. AIST ran HPL-MxP at  both 16-bit and 8-bit precisions, resulting in their ABCI 3.0 system  simultaneously ranking at #6 and #10.</p><p>  These sorts of issues make it easy to question the value of leaderboards like  Top500 or HPL-MxP, as their definition of \"performance\" becomes increasingly  further divorced from how large supercomputers are really used. The past few  years have shown that there hasn't been the time or energy to get ahead of  these ambiguities amongst the three men maintaining the list, so transitioning  it to ACM will hopefully be a positive move that will give the list a chance  to be revitalized.</p><p>  To their credit,  <strong>the incipient stagnation of the Top500 list</strong> was called out by  Strohmaier during his analysis of the list, acknowledging that \"growth has  tremendously slowed down compared to what it used to be\" and \"we don't have  proof of what is actually the reason for that:\"</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">All the key highlights of this SC's Top500 list.</figcaption></figure></div><p>China has stopped submitting, the AI and hyperscale providers really never  started submitting, and retired systems are being thrown off the list long  before they fall off the bottom. To me, this was a tacit acknowledgment that  the list does not have a bright future out to 2030 unless it is modernized to  be relevant to the way in which today's largest systems are actually being  used--which is not DGEMM.</p><p>  The final surprising acknowledgment during Strohmaier's talk was that  <strong>the list is trailing the state of the art in hardware</strong> by  quite a bit. He pointed out that Blackwell systems are only now starting to  appear even though they've been shipping in volume for the better part of a  year. While he hypothesized that there is \"uneasiness\" about Blackwell in an  HPC context, the reality is that there are no Blackwells for HPC until the  Blackwell orders for hyperscale AI have been fulfilled. HPC is second in line,  and even then, the only Blackwells I could find on this year's Top500 list  were NVL8 configurations--not the NVL72 configurations that have been filling  up hyperscale datacenters like  <a href=\"https://glennklockwood.com/garden/systems/Fairwater\">Fairwater</a>.</p><p>  Strohmaier pointed out that Blackwell, by virtue of its HBM3e (vs. Hopper's  HBM3), is showing up higher on the HPCG list (which is a memory bandwidth  test) than on Top500 (which is an FP64 FLOPS test). He phrased this as  evidence that \"not everything is bad for the HPC community,\" but I would have  phrased my conclusion a little differently:</p><ol type=\"1\"><li>    Blackwell is actually great for HPC, because most real workloads are    memory-bandwidth bound, not FLOPS bound. The fact that B200 offers similar    FP64 FLOPS at higher memory bandwidth means that real applications will get    higher effective use of those FP64 FLOPS.  </li><li>    Despite the above, Blackwell doesn't perform well on Top500 because HPL    doesn't reflect the reality that memory bandwidth is important. It follows    that HPL doesn't reflect the reality of real HPC applications. A Blackwell    system can be significantly better for real HPC applications than a    comparably sized Hopper system even though it may rank lower than Hopper on    Top500.  </li><li>    Blackwell isn't showing up in volume now because the HPC community is second    in line. The HPC community isn't uneasy as much as it is completely locked    out. The first NVIDIA-based exascale system debuted in November 2025 despite    its GPU being three years old, suggesting that if big Blackwell systems ever    appear on Top500, it'll happen in 2026-2027.  </li></ol><p>  All of this is a roundabout way of showing that the big number--in this case,  the HPL score--no longer leads meaningful conversation around how useful a  system is for science.</p><h4 id=\"the-gordon-bell-prize\">The Gordon Bell Prize</h4><p>  Another major indicator of the changing tide away from the big number was the  work that won this year's  <a href=\"https://awards.acm.org/bell\">Gordon Bell Prize</a>. The winning  paper, titled \"<a href=\"https://arxiv.org/html/2504.16344v2\">Real-time Bayesian inference at extreme scale: A digital twin for tsunami    early warning applied to the Cascadia subduction zone</a>,\" wasn't the typical case of running a huge simulation for a few hours and  reporting some result. Rather, it described a four-step workflow that  culminates in the desired insight popping out of a computation that runs  across only 128 nodes and completes in less than 0.2 seconds. Furthermore, the  hero run part could be decomposed into trivially parallel components, allowing  the bulk of the computation to be geographically distributed across HPC  centers or GPUs spread across on-prem and cloud providers.</p><p>  My understanding of the work is that there was a massive \"offline\" computation  to precompute a few key matrices (Phase 1) followed by two shorter offline  steps that turn those matrices into the core of the digital twin. The last  step, which was \"online\" and designed to be computed in real-time, could then  take this core and solve the input problem with extremely low latency. This  workflow front-loads a hero run in such a way that, if an earthquake were to  occur, the risk of tsunami could be calculated in less than a second using  only modest compute resources and the precomputed core.</p><p>  The authors eschewed methods that generated tons of FLOPS in favor of methods  that were less FLOPS-efficient but got to the answer faster. In the authors'  own words:</p><blockquote><p>    As shown in Fig. <a href=\"https://arxiv.org/html/2504.16344v2#S7.F7\">7</a>, higher FLOP/s does not necessarily lead to faster time-to-solution. On    MI300A nodes of <em>El Capitan</em>, the best-performing    implementation, Fused PA, achieves a lower percentage (5.2%) of theoretical    peak FLOP/s than Fused MF (5.5%) but is faster.  </p></blockquote><p>  Interestingly, the hero computation here was embarrassingly parallel(ish) as  well; in their demonstration run, the hero run (Phase 1) was broken into 621  independent calculations each requiring 128 nodes (512 A100 GPUs) for about an  hour. Because they are independent, these tasks could be parallelized across  multiple HPC centers as well, and my understanding of the data volumes  involved are modest; Phase 1 would require a single shared copy of the input  mesh (a hundred GiB?) per HPC center, and each of the 621 tasks would output  around 8 GiB which would have to be copied back.</p><p>  While I don't understand the mathematics behind this work, the paper took what  would've been a huge exascale-class mathematical problem (\"10 years on a  sustained 1 EFLOP/s machine\") and reformulated it into a workflow that solves  the problem faster and more usefully. Instead of brute-forcing the problem  with a big supercomputer, they split it into separate offline and online  parts, and this naturally allowed the most computationally expensive part to  be geographically distributable.</p><p>  This work surrendered the need for a single big machine, and it didn't produce  a big-number result. But it did win the Gordon Bell Prize, again signaling  that the HPC community is beginning to look beyond performance-only and think  about awarding innovation according to outcomes, not just FLOPS.</p><p>  The talk for this paper can be viewed  <a href=\"https://sc25.conference-program.com/presentation/?id=gb106&amp;sess=sess577\">here in the SC25 Digital Experience</a>.</p><h4 id=\"fixing-problems-caused-by-the-big-number\">  Fixing problems caused by the big number</h4><p>  Most of my perception around the HPC community beginning to de-emphasize the  singular big machine or big number arose from organic interactions I had with  colleagues and customers though. It's hard to summarize how these  conversations went, but the  <a href=\"https://sc25.conference-program.com/presentation/?id=bof197&amp;sess=sess439\">Lustre Community BoF</a>  is a good example of what I saw elsewhere.</p><p>  Lustre has long been the gold standard in high-performance parallel I/O in the  HPC community because it was designed from day one to deliver high bandwidth  above all else. As a result, Lustre already has the big number solved in many  ways, and events like the Lustre BOF are a great case study in what it looks  like for a performance-first technology to be pushed into adapting to deliver  more than just a big number.</p><p>  First, the ever-innovative Stéphane Thiell from Stanford discussed the process  and tooling he developed to enable online capacity expansion of a Lustre file  system. The basis for it was a distributed, fault-tolerant tool he developed  that uses redis, lfs find, and lfs migrate to manage the state of file  migrations across Lustre servers as the file system is rebalanced. While a  part of me thought this was a great tool that would be super helpful for many  others, another part of me was kind of horrified.</p><p>  Maybe I've been spoiled by working in hyperscale and AI these past three  years, but online capacity expansion and rebalancing is a built-in capability  of all distributed storage systems these days. All the major cloud object  stores do this, as do all modern parallel file systems including Quobyte,  VAST, and WEKA. Of course, none of these modern systems are as efficient (on a  per-CPU core or per-SSD basis) as Lustre at delivering peak performance. But  Stéphane's talk made me realize the price that's paid for this great  performance.</p><p>  Andreas Dilger and others went on to talk about Lustre futures, and as they  were speaking, I noticed that nobody was talking about performance  improvements to Lustre. Rather, feature development was focused on catching up  in every other dimension--data governance, reliability, manageability, and  others. For example, Andreas talked a bit about the upcoming \"multi-tenancy\"  features coming to Lustre:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">It's a lot of work to retrofit multitenancy into a performance-first file system.</figcaption></figure></div><p>I put “multi-tenancy” in quotes because these changes really representtrying to back into a security posture that is fundamentally different from theone that Lustre was designed around. In the pursuit of performance, Lustre (aswith most other HPC technologies) was designed assuming that security wassomeone else’s problem. By the time someone could log into a system that couldmount a Lustre file system, they had already been authenticated, and it was upto the OS on each compute node to authorize any interactions with Lustre itself.This is the “implicit trust” model.</p><p></p><p>  The problem, of course, is that the rest of the world has adopted a \"zero  trust\" model which makes many things (except performance!) generally easier.  Compliance is easier when the system assumes that everything is encrypted as a  default and key management can be delegated to a third party. Because Lustre  didn't do this from the outset, it is going through this process of  retrofitting encryption in various places and using a mixture of nodemaps,  UID/GID maps, and shared secrets to patch over all the places where trust was  fundamentally implicit.</p><p>  Later on in the BOF, panelists acknowledged (some half-heartedly) that  manageability of Lustre was a barrier. One panelist admitted that it took five  years of work to almost get to the point where a Lustre update can be done  without crashing applications. Another panelist said that multitenancy in  Lustre is easy <em>if you follow a million steps</em>, and that his company  was developing script-based ways to simplify this. While the idea of using  scripts to simplify operations is not bad, from a secure supply chain  standpoint, relying on third-party bash scripts to enable features required  for legal compliance is horrifying.</p><p>  I don't mean to pick on Lustre alone here; other HPC technologies such as  InfiniBand, Slurm, and DAOS are facing the same reality: retrofitting modern  requirements like security and manageability into architectures that  prioritized performance and scalability over everything else are now going  through similar contortions to meet modern requirements around data  governance. For those HPC centers who do not have to worry about compliance  (which is most of open-science computing), these technologies will continue to  be just fine.</p><p>  However, the  <a href=\"https://blocksandfiles.com/2025/11/18/vast-data-dell-versity-and-spectra-logic-are-shining-storage-stars-on-taccs-horizon/\">successes of these modern file systems</a><a href=\"https://blocksandfiles.com/2025/07/04/vast-doudna-supercomputer-storage/\">across leading HPC centers</a>  and the proliferation of alternative technologies such as  <a href=\"https://nrp.ai\">Kubernetes-based HPC</a> and  <a href=\"https://blogs.microsoft.com/blog/2025/11/12/infinite-scale-the-architecture-behind-the-azure-ai-superfactory/?utm_source=chatgpt.com\">MRC over Ethernet</a>  tells me that HPC coming around to the idea that marginal increases in  performance are no longer worth missing out on factors that weigh heavily on  day-to-day operations like manageability, reliability, and flexibility.</p><h3 id=\"theme-2-hpc-policy-is-becoming-ai-policy\">  Theme 2: HPC policy is becoming AI policy</h3><p>  Some of the biggest news at SC was not actually showcased at the conference  despite being what many people wanted to talk about in side conversations: HPC  policy is rapidly becoming AI policy, resulting in a slew of huge (but poorly  defined) \"public-private partnerships.\"</p><p>  As a bit of background, the Oak Ridge Leadership Computing facility announced  its next system, Discovery, in late October--this was the result of a  \"typical\" supercomputer procurement process that  <a href=\"https://www.nextplatform.com/2023/10/02/the-first-peeks-at-the-doe-post-exascale-supercomputers/\">first came into the public eye in 2023</a>. However, the Discovery announcement also included mention of a smaller  system, Lux, which will \"<a href=\"https://www.olcf.ornl.gov/2025/10/27/ornl-amd-and-hpe-to-deliver-does-newest-ai-supercomputers-discovery-and-lux/\">leverage the Oracle Cloud Infrastructure (OCI)</a>\" (whatever that means) to provide earlier access to AMD MI355X GPUs ahead of  Discovery's full-scale deployment.</p><p>  Then, two days later, Argonne National Laboratory announced a  <a href=\"https://www.energy.gov/articles/energy-department-announces-new-partnership-nvidia-and-oracle-build-largest-doe-ai\">similar arrangement with Oracle Cloud and NVIDIA</a>  to deliver a small (Lux-sized) GPU supercomputer named Equinox, followed by a  much-larger 100,000-GPU supercomputer named Solstice. Neither Equinox nor  Solstice are attached to a \"typical\" supercomputer procurement; the follow-on  to Aurora, to be named  <a href=\"https://intro-hpc-bootcamp.alcf.anl.gov/sites/hpc/files/2025-09/WelcomeToHPC_Papka.pdf\">Helios</a>, is  <a href=\"https://www.alcf.anl.gov/draft-technical-requirements-alcf-4-system\">still in planning</a>  and will be deployed in 2028. This strongly suggests that, whatever  \"public-private partnership\" means to the DOE, it is not the same as the  typical leadership computing systems; it is its own AI-centric program.</p><p>  At SC itself, Evangelos Floros (EuroHPC's head of infrastructure) also  mentioned the \"need for public-private partnerships\" to realize EuroHPC's goal  of building \"AI Gigafactories\" with \"100,000 advanced AI processors\" across  Europe.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">\"Need for public-private partnerships\" to fund AI factories is recognized by EuroHPC too.</figcaption></figure></div><p>Again, what exactly this \"public-private partnership\" model entails in Europe  was never really defined.</p><p>  What was clear is that both American and European efforts are declaring the  need to build massive (100K+ GPU) supercomputers for AI, the traditional HPC  centers will be the public stewards of them, and \"public-private partnerships\"  are the only way to realize them since governments alone cannot foot the bill.</p><p>  The Top500 BOF also included a short, awkward talk by Rick Stevens titled \"The  DOE AI Initiatives\" that amounted to Stevens saying he had nothing to say.  What really happened, I suspect, is that DOE's new \"<a href=\"https://genesis.energy.gov\">Genesis Mission</a>,\" which was announced the week <em>after</em> the SC conference, was a week  late and therefore couldn't be discussed as originally planned. If Stevens had  been able to describe the Genesis Mission, though, I'm sure he would've also  described \"public-private partnership\" as a key aspect, since the same  language is used in the  <a href=\"https://www.whitehouse.gov/presidential-actions/2025/11/launching-the-genesis-mission/\">Executive Order that established Genesis</a>. And I'm sure his description would've been no clearer about what this  really means than what EuroHPC or the OCI/DOE descriptions have stated.</p><p>  Most revealing was my observation that, even outside of the proper conference  program, nobody really knew what any of this meant. I talked to plenty of my  colleagues from both government HPC and hyperscale cloud organizations, and  the only consistent message was that there aren't many concrete facts backing  up the the press releases right now. It appears that these partnerships were  brokered far outside the usual channels that large supercomputer procurements  are normally done, and the people in charge of actually delivering on the  promises of the press releases are still figuring out what is possible.</p><p>  Connecting the dots between Lux/Equinox/Solstice, Genesis, and a recent  <a href=\"https://www.energy.gov/sites/default/files/2025-04/RFI%20to%20Inform%20Public%20Bids%20to%20Construct%20AI%20Infrastructure%20%28website%20copy%29.pdf\">RFI</a>  and  <a href=\"https://sam.gov/workspace/contract/opp/7864e8f4d61f42dc811ba095a41c8368/view\">RFP</a>  from DOE to allow  <a href=\"https://www.energy.gov/articles/doe-announces-site-selection-ai-data-center-and-energy-infrastructure-development-federal\">hyperscalers to build AI factories on federal land</a>, it appears that what is happening is...</p><ul><li>    The DOE has a bunch of land that is adjacent to the National Labs that is    undeveloped but has the infrastructure to support massive AI factories.    Specifically named is a 110-acre parcel at Argonne that can accommodate up    to 1 GW \"AI data park,\" and a 100-acre parcel at Oak Ridge with up to 800    MW. These details were disclosed in    <a href=\"https://www.energy.gov/sites/default/files/2025-04/RFI%20to%20Inform%20Public%20Bids%20to%20Construct%20AI%20Infrastructure%20%28website%20copy%29.pdf\">an RFI they issued earlier in the spring</a>.  </li><li>    The    <a href=\"https://www.energy.gov/articles/energy-department-announces-new-partnership-nvidia-and-oracle-build-largest-doe-ai\">Solstice press release</a>    specifically said that DOE envisions \"shared investments and shared    computing power between government and industry.\" Given the RFI/RFP were    about land leases, these public-private partnerships may involve splitting    the costs of space/power/cooling (the land and infrastructure being leased)    and the capital/operations (the supercomputer cloud services being built)    between the Labs and Oracle.  </li></ul><p>  A potential model for operations is that cloud providers are allowed to build  and operate commercial AI cloud services adjacent to the DOE HPC facilities in  exchange for the DOE Genesis Mission being entitled to some of those AI cloud  capabilities. Exactly how much supercomputing resources hyperscalers like OCI  would give to DOE, and exactly how much it would cost the DOE Labs to serve as  landlords, is probably still undefined. But seeing as how power is the single  biggest limiter in AI these days, I expect this model will only spread costs  around, not actually lower them.</p><p>  If this is indeed how Genesis plays out, this would establish a bizarre new  way for the government to acquire HPC (or AI) capabilities that completely  sidesteps the standard procurement model. Instead of plunking down a hundred  million dollars a year to finance a new leadership supercomputer, we might be  moving into a world where the Labs plunk down a hundred million dollars a year  to cover the costs of power, space, and cooling for a cloud provider. And  instead of owning a leadership supercomputer, these national HPC facilities  wind up consuming HPC (well, AI) resources from cloud providers--hopefully at  a cost that reflects the fact that the cloud providers are also profiting from  cycles being sold off of these machines to commercial AI customers.</p><p>  But again, this is all speculation based on the consistencies I heard  throughout the conference and the experience I had trying to build these sorts  of partnership with the HPC community while I worked at Microsoft. I may be  right, or I may be wildly wrong. There are probably only a handful of people  in the world with a clear idea of what these partnerships are meant to look  like right now, and they are all way above the heads of the people at the HPC  centers who will be tasked with executing on the vision.</p><p>  Selfishly, I am also left with a bit of heartburn over all of this news. I put  a lot of personal time and energy into giving the HPC community the  information it needed to feel comfortable about partnering with hyperscale AI  infrastructure providers while I was at Microsoft, and it often felt like a  Sisyphean task. Within months of me giving up and moving on from my career at  a cloud provider, seeing a complete reversal of policy from the leadership HPC  folks--and to see the \"other guy\" in pole position--is a bit of a slap in the  face.</p><p>  I also couldn't help but notice that the cloud provider in all the headlines  in the US didn't seem to demonstrate a very strong and unified presence at SC  this year. Comically, they didn't even use their own brand's colors for their  booth on the exhibit floor. And the color scheme they did use left no room for  Oak Ridge's Lux system, which will be AMD-based, to be showcased.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Oracle's booth at SC25. Their brand color is red, not green. Or so I thought.</figcaption></figure></div><p>Though I may have read too much into this, it feels like these public-private  partnerships are not necessarily composed of equal partners with equal levels  of commitment.</p><p>  More broadly, I left the conference concerned that the discourse happening  around these cloud-HPC/AI integrations--at least in the US--appears to have  regressed compared to where it was when I worked at Microsoft. Many of the  things we had to figure out years ago (cybersecurity models, impacts on jobs  at the HPC centers) seem to have reset to zero. And sidestepping the  procurement processes for leadership computing to enable these public-private  partnerships will either require significant new funding (of which Genesis  provides none; the executive order as-written appears to recolor existing  money) or robbing Peter (the budget funding the next generation of leadership  HPCs) to pay Paul (the cloud providers serving up compute resources for AI).  As a result, I can envision a future where all of the money that used to fund  leadership computing for science becomes money to fund commercial AI  factories, resulting in a slow evaporation of the LCFs as their HPC  capabilities shrink in size and relevance.</p><p>  Though there's lots more to be said on this topic, it's all based on  conjecture. So, maybe the best thing to do is quietly wait and see.</p><h3 id=\"theme-3-ai-discourse-is-growing-up\">  Theme 3: AI discourse is growing up</h3><p>  This was the first SC where it felt like the discourse around AI's role in the  future of scientific computing actually carried some substance. Whereas  previous years saw talk that mostly revolved around basic ideas like \"do LLMs  hallucinate too much?\" or \"can ChatGPT write MPI code?,\" I sat in on a number  of interesting talks and conversations that skipped the question of \"is AI  useful?\" and went straight to \"this is how AI is proving useful to us.\"</p><p>  Maybe it's related to the previous theme: HPC money is becoming AI money, so  AI research is becoming required to stay afloat. Or maybe it's because 2025  has been the year of agentic AI, and agents allow LLMs to be integrated much  more surgically into complex workflows. Or maybe confirmation bias led me to  sit in sessions and talk with people who are at the frontier of applying AI to  scientific discovery. Whatever the case may be, I was glad to hear so much  discussion from researchers around the importance of all the connective tissue  required to operationalize AI in scientific computing.</p><h4 id=\"agentic-workflows\">Agentic workflows</h4><p>  A great example of this was the  <a href=\"https://aiexscale.github.io\">1st International Symposium on Artificial Intelligence and Extreme-Scale    Workflows</a>, which happened on Friday. One of the invited speakers, Dr. Katrin Heitmann,  connected a lot of dots in my head with a talk she gave on how massive-scale,  physics-based simulation workflows can benefit from agentic AI.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Heitmann's vision on how agentic approaches can augment (but not replace) humans in complex scientific workflows.</figcaption></figure></div><p>The crux of the challenge faced by most massive-scale simulation (like  <a href=\"https://cpac.hep.anl.gov/projects/hacc/\">HACC</a>, the cosmology code  for which she is famous) is that they generate massive amounts of data. The  <a href=\"https://www.anl.gov/cels/article/simulating-the-cosmos-frontiere-sets-new-record-with-trillionparticle-universe-model\">most recent HACC run</a>  generated hundreds of terabytes of compressed data per checkpoint and over a  hundred petabytes of data in the end; this cosmological simulation serves as a  reference dataset from which downstream cosmological research can draw when  exploring targeted questions. The challenge, of course, is finding relevant  pieces of the simulated universe from amidst a hundred petabytes of raw data.</p><p>  Dr. Heitmann's premise is that agents and tools have very specific scopes and  capabilities, and researchers have control over which of these tools they wish  to use. However, they can hand off these tools to an agentic workflow to let  it autonomously sift through all of the data, looking for specific features  within the simulated universe that are relevant. A specific example she gave  was the process of examining 500 million galaxy clusters; with an agentic,  AI-driven approach, a postdoc was able to interactively sift through these  objects without examining each one individually. For truly interesting  objects, a separate agent could go search the literature and provide an  explanation as to why it may be interesting, absolving the postdoc from having  to make round trips between the dataset and external literature.</p><p>  That all said, it was clear from this talk (and others) that integrating  agentic AI into scientific inquiry is still in its early days. But what I  appreciated about this talk (and the entire workshop) is that it sidestepped  pedestrian questions about trustworthiness by acknowledging that the goal  isn't full autonomy, but rather, enabling researchers to do things faster.  There is still a human at the start and the end of the workflow just as there  always has been, but agents can reduce the number of times a human must be in  the loop.</p><h4 id=\"data-and-agentcentric-service-infrastructure\">  Data and agent-centric service infrastructure</h4><p>  Even when AI wasn't the main topic of discussion, it was clear to me at this  SC that AI is influencing the way researchers are thinking about the  infrastructure surrounding supercomputers. A great example of this was the  keynote at the <a href=\"https://www.pdsw.org/index.shtml\">PDSW workshop</a>,  given by the ever-insightful  <a href=\"https://sc25.conference-program.com/presentation/?id=misc185&amp;sess=sess202\">Dr. Rob Ross, where he offered a retrospective on the work his team has    done over the last two decades</a>, what he felt they got right, what they missed, and what's ahead.</p><p>  Towards the end of his presentation, he made the case that \"science is  increasingly multi-modal.\" But rather than talk about multimodality in the AI  sense, he was emphasizing that there's more to scientific computing than  performance:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Domain science, provenance, search, and resilience are equal partners to performance in scientific computing.</figcaption></figure></div><p>Taken at face value, this slide positions performance on equal footingwith domain science, provenance, findability, and his argument was that we’vemoved beyond the world where the only storage problem that HPC faces ischeckpointing. Just as Dr. Heitmann would say on Friday, Dr. Ross’ argument wasthat the increasing volume of scientific data coming out of both exascalesimulation and scientific instruments is driving the field towards moreautomation. And with automation comes a greater need to understand dataprovenance–after all, if automation produces a surprising result, a humanultimately has to go back and understand exactly how the automation generatedthat result.</p><p></p><p>  He also point out that in this coming world of automation-by-necessity,  infrastructure itself might have to be rethought. After all, traditional  technologies like parallel file systems were designed to make the lives of  human researchers easier; when the primary consumer of data becomes AI agents,  not humans, there may be better ways to organize and expose data than through  files and directories. A human might repeatedly cd and ls to find a specific  dataset on a file system, whereas an agent use a query a flat index to find  the same data in a single step.</p><p>  At the end of the same PDSW workshop, I was fortunate enough to contribute to  <a href=\"https://sc25.conference-program.com/presentation/?id=miscp112&amp;sess=sess202\">a panel</a>  where many of these same themes--how will data systems change as AI plays a  greater role in scientific discovery--were discussed. Although we touched on a  lot of topics, what stuck with me was a general acknowledgment that, while HPC  has always talked about data management and provenance as being important,  they were always treated as a \"nice to have\" rather than a \"must have.\"  However, as was echoed across many presentations (including the two I  described above), governance and provenance are now becoming non-negotiable as  larger datasets drive us towards AI-driven automation.</p><p>  Regardless of what you think about AI's ability to accelerate scientific  discovery, I left SC with the feeling that AI is forcing the HPC community to  grow up with regards to how seriously it takes data management:</p><ul><li>    The size and velocity of datasets generated by simulation or experiment is    growing beyond any single person's ability to analyze it by hand. The    complexity of these data are also making it harder to develop    herustics-based or analytical approaches to combing through all of it.  </li><li>    The best path forward to understanding these data is through AI (via    purpose-built models for analysis) or AI-driven data exploration (via    autonomous, agentic workflows).  </li><li>    Automation or autonomous workflows will always act under authority delegated    to them by human researchers, meaning there is a growing need to be able to    inspect how these workflows arrived at the conclusions they generate.  </li><li>    Understanding how an answer was achieved requires significantly better data    management features such as governance, provenance, and auditability. A    result is ultimately only useful if a human can trust it, and that trust    comes from understanding which data informed that conclusion, how that data    was created, and how it was modified over time.  </li></ul><p>  Put differently, checkpointing was the main concern of I/O research because  I/O performance was the first scalability issue that scientific computing ran  into as supercomputers and scientific instruments got bigger. However, we're  now at a point where issues ancillary to performance have reached the limits  of scalability. Dr. Ross's multi-modal slide indicate that provenance,  indices/search, and resilience are some examples of these new barriers, but  there are plenty more as well.</p><p>  In a sense, this theme is the opposite side of the same coin as the first  theme I discussed--that the big number is losing its shine. The hardest  questions going forward aren't the obvious ones about scaling performance;  they are about scaling everything else to keep up. AI seems to be the  technology that has cleared a path to these data management hurdles, but the  benefits of adopting strong data management practices and systems will extend  far beyond the reach of just enabling AI-based automation.</p><h2 id=\"the-exhibit-hall\">The exhibit hall</h2><p>  The exhibit hall has long been one of my favorite parts of attending SC  because it's a great way to get a feeling for what technologies and vendors  are hot, where the innovation is trending, and what sorts of commercial  problems are worth solving. Every year I feel like I have less and less time  to walk the exhibit hall though, and the layout and composition of this year's  exhibition meant I only saw a small fraction of what I wanted to see in the  few days it was up.</p><p>  The most common comment I heard about the exhibit this year is captured in  Doug Eadline's article,  <a href=\"https://www.hpcwire.com/2025/11/26/sc25-observations-more-pumps-than-processors/\">SC25 Observations: More Pumps than Processors</a>  (which is well worth the read!). The same commentary was repeated throughout  the OCP conference in October as well, suggesting that there is a lot of money  to be made (or at least the prospect of money) in helping datacenters get  outfitted for the liquid cooling demanded by the next generation of  large-scale GPU infrastructure. However, I found the overwhelming amount of  space devoted to liquid cooling companies acutely problematic at SC25 this  year for two reasons:</p><ol type=\"1\"><li><strong>Most SC attendees have nothing to do with liquid cooling</strong>. A    colleague of mine who operates supercomputers for the energy sector asked    one of these big liquid cooling vendors what he could do to actually engage    with them. After all, he doesn't buy liquid cooling infrastructure; he buys    whole supercomputers that come with heat exchangers and CDUs that are    integrated into the solution. The vendor had no good answer, because the    reality is that the typical supercomputer user or buyer has no say over what    piping, coolant, or exchangers are used inside the machine itself. The whole    point of buying an integrated supercomputer is to not have to deal with that    level of details.  </li><li><strong>These liquid cooling vendors soaked up a ton of floor space</strong>. A few of these physical infrastructure providers had massive (50x50)    booths sprinkled across the exhibit hall. Combined with the fact that the    average SC attendee has nothing to do with liquid cooling meant that the    booths that were more likely to be relevant to a typical attendee were much    further apart than they had to be.  </li></ol><p>  The end result was that the exhibit hall was absolutely gargantuan and yet  information-sparse. In fact, this year saw a secondary exhibit hall in the old  football stadium serve as overflow space, because the entire primary exhibit  hall was full. What's worse is that this overflow space was (as best as I  could tell) completely disconnected from the main hall, and the only time I  ever saw it was from the dining area used to serve lunch for the tutorials.</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">The exhibit hall's overflow space being set up in the former football stadium.</figcaption></figure></div><p>I would’ve been furious if if I had been stuck with a booth in thisoverflow space, because I can’t imagine the foot traffic in there was very high.I personally couldn’t even find the entrance to this second exhibition area inthe few hours I had to look for it.</p><p></p><p>  I can't help but think the SC organizers leaned far too much into booking up  as much space (and therefore exhibitor dollars) as possible without thinking  about the dilutive effects of having such a massive vendor count. Some vendors  definitely benefitted from having a good location near one of the hall  entrances, but I also heard a nontrivial amount of grumbling around how little  traffic there was at some of the big booths. It wouldn't surprise me if there  was a contraction of the HPC mainstays at SC26.</p><h3 id=\"by-the-numbers\">By the numbers</h3><p>  Rather than rely solely on anecdotes though, it's also fun to take a  quantitative look at the changes in exhibitors relative to last year. Since I  spent the time figuring out how to generate tree maps for my SC24 recap last  year, I figured I should re-run the same analysis to compare SC25 to SC24.</p><p>  Of the biggest booths who were exhibiting for the first time this year, it  should be no surprise that the two biggest new entrants were Danfoss (liquid  cooling infrastructure) and Mitsubishi Heavy Industries (gas turbines and  other large-scale infrastructure):</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure> <figcaption class=\"image-caption\">New exhibitors with the largest booths.</figcaption></figure></div><p>Of the other top new exhibitors, some (Solidigm, Sandisk, C-DAC, MinIO, and  University of Missouri Quantum Innovation Center) were quite relevant to the  typical SC attendee. Arm was also back after having skipped SC24. But there  were scores of new exhibitors whose services and products seem much more  relevant to very niche aspects of physical datacenter infrastructure.</p><p>  Of the exhibitors who didn't show up to SC25 but had big booths at SC24, there  was a diverse mix of markets:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Vendors who didn't show up to SC'25 but had big booths at SC'24.</figcaption></figure></div><p>Sadly, higher ed and government popped up on this list (see  <a href=\"https://www.hpcwire.com/2025/11/26/sc25-observations-more-pumps-than-processors/\">Doug Eadline's take on this for more</a>). A bunch of datacenter infrastructure providers also vanished, including  Valvoline and Boundary Electric; this suggests that some of the top new  vendors of this year (Danfoss, Mitsubishi) may similarly vanish entirely next  year after realizing that SC isn't really their crowd. But I was also  surprised to see some big names in AI vanish; Iris Energy (IREN) is a GPU  cloud provider that just inked a multi-billion dollar deal with Microsoft;  Ingrasys manufactures much of the world's GB200 NVL72 infrastructure; Groq,  Sambanova, and SambaNova also inexplicably vanished.</p><p>  Perhaps more interesting are the top growers; these vendors exhibited both  last year and this year, but went significantly larger on their booth sizes:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Biggest increases in booth size at SC'25 vs. SC'24.</figcaption></figure></div><p>Legrand, which provides datacenter infrastructure bits, likely grew as a  result of it acquiring USystems and merging USystems' booth with Legrand's  booth this year. The other big booth expansions are mostly household names  though; Gates, EBARA, and GRC are cooling vendors that the typical SC attendee  can't do much with, but the others are organizations with whom a researcher or  HPC datacenter operator might actually talk to.</p><p>  Finally, the top contractions in booth space are a mix of service providers,  HPC facilities or research centers, and component suppliers:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Biggest decreases in booth size at SC'25 vs. SC'24.</figcaption></figure></div><p>Of the biggest vendors who downsized, Carahsoft is a component reseller and  service provider, Stulz is a liquid cooling company, HLRS is a German  supercomputer center, and Viridien is an HPC services company that primarily  serves the energy sector. It is surprising to see AWS shrink while Microsoft  grew, and it is doubly surprising to see Oracle shrink when it's at the center  of the biggest HPC deployment news of the season. Given that these booth sizes  are chosen a year in advance, this may speak to how unexpected the turn of  events were that resulted in Oracle carrying the cloud services end of DOE's  big public-private partnerships.</p><h3 id=\"interesting-new-technology\">Interesting new technology</h3><p>  For reasons I'll discuss later, I didn't have much time to walk the exhibit  hall floor. Combined with the fact that everything was so spread out and  diffuse, I just didn't get a great sense of what interesting new technology  was being introduced this year beyond what tended to stick out. And amidst all  the giant CDUs and liquid cooling infrastructure, it was hard for anything to  stick out except really big compute cabinets.</p><h4 id=\"dell-ir700\">Dell IR700</h4><p>  Dell's booth had a fully loaded IR7000 rack on display (as they did at  <a href=\"https://blog.glennklockwood.com/2025/03/gtc-2025-recap.html#dells-480-kw-ir7000\">GTC earlier in the year</a>) with 36 GB200 NVL4 sleds. At 50OU high (almost eight feet tall), this thing  is physically huge:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Dell's 50OU IR7000 rack, fully loaded. This is what TACC Horizon will be built from.</figcaption></figure></div><p>Unlike the version they had on display at GTC though, this one had both the  front door and a full rear-door heat exchanger installed:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">HUGE rear-door heat exchanger on the back of the Dell IR7000 rack.</figcaption></figure></div><p>What's notable about this platform is that we now know that it is the basis  for both  <a href=\"https://tacc.utexas.edu/systems/horizon/\">TACC's upcoming Horizon system</a>  (which will have  <a href=\"https://glennklockwood.com/garden/systems/Horizon\">28 of these fully loaded racks</a>) and  <a href=\"https://www.nersc.gov/what-we-do/computing-for-science/doudna-system\">NERSC's upcoming Doudna system</a>  (which will have Vera Rubin rather than Blackwell). This rack was nominally  designed for hyperscale AI and is the basis for Dell's GB200 NVL72 (XE9712)  deployments at places like CoreWeave and xAI, which means that it'll be  thoroughly tested at scale long before TACC or NERSC have it up and running.  This is the opposite of what has historically happened: before AI, it was  usually government HPC that had to debug new rack-scale architectures before  industry would touch it.</p><h4 id=\"hpe-cray-gx5000\">HPE Cray GX5000</h4><p>  However, government HPC will still have a chance to debug a new supercomputing  platform in the recently announced  <a href=\"https://glennklockwood.com/garden/Cray-GX\">Cray GX</a> (formally  called \"the HPE Cray Supercomputing GX platform\"), which is the successor to  the current Cray EX platform. This is the platform that the  <a href=\"https://glennklockwood.com/garden/systems/Discovery\">Discovery supercomputer</a>  at OLCF will use, and HPE had a CPU-only blade (<a href=\"https://glennklockwood.com/garden/nodes/Cray-GX250\">Cray GX250</a>) and a rack mockup on display at SC:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">HPE's new GX blade form factor. This one appears to be the GX250, the 8-socket CPU-only blade.</figcaption></figure></div><p>It's hard to tell the size of this blade from the photo, but if you look at  the relative size of the CPU socket and the DIMM slots, you can get a sense of  how physically massive it is--it's like a coffee table. It also isn't  perfectly rectangular; Cray decided to put this unusual protrusion on the  front of the blades which is where the four NICs and eight E1.S SSDs are  housed:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">A look at the side of the Cray GX blade's \"nose\" showing the side-mounted NIC ports.</figcaption></figure></div><p>This nose(?) adds more surface area to the front of the rack, and it makes  more sense when you see a rack full of these nodes. HPE had a full GX5000 rack  with mocked-up cardboard nodes in their booth as well:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Fully loaded GX5000 rack. The nodes were cardboard, but pretty nice cardboard.</figcaption></figure></div><p>By having the NIC ports (which are Slingshot 400) face the sides of the rack  rather than stick out the front, the bend radius of all that copper doesn't  have to be quite as dramatic to route it along the sides of these node noses.  And unlike previous Cray designs, there's also no midplane or backplane that  connect the nodes in a rack to the rack-local switches; everything connects  through discrete copper or optical cables.</p><p>  At the center of the rack is a liquid-cooled switch chassis, and each rack can  support either 8-, 16-, or 32-switch configurations. Each switch is a 64-port  Slingshot 400 switch, and I think the premise is that a single GX5000 rack is  always exactly one dragonfly group. If you want a smaller group, you use a  switch chassis with fewer switches.</p><p>  Interestingly, this GX will also support non-Slingshot Ethernet and XDR  InfiniBand switches. Given that both XDR InfiniBand and 800G Ethernet are  shipping today and have twice the bandwidth that Slingshot 400 will have when  it starts shipping in a year, perhaps the Slingshot 400 option is just a  stopgap until HPE's investments in Ultra Ethernet result in a product. The  lack of a network backplane in the rack also makes it easier for the rack to  accommodate the non-dragonfly topologies that would be required for InfiniBand  or Ethernet.</p><p>  The rear of the rack is remarkably unremarkable in that it simply contains a  rear bus bar and the liquid cooling manifolds and mates. In this sense, the  rack looks very  <a href=\"https://www.opencompute.org/documents/open-rack-base-specification-version-3-pdf\">OCP-like</a>; the boring stuff is in the back, everything exciting is serviced from the  front, and the rack itself is passive plumbing. Like any OCP ORv3 rack, power  shelves slot in just as server blades do, and they use the same liquid cooling  infrastructure as the rest of the rack. They power the bus bar, and the blades  and switches draw from the same bus bar.</p><p>  Compared to an ORv3 rack though, these GX racks are wider and shorter. The  width probably offers more flexibility for future NVIDIA or AMD GPU boards,  but I was surprised that Cray didn't go ultra tall like Dell's 50OU IR7000. I  was also surprised to hear that Cray is launching GX with a 400 kW cabinet  design; power appears to already be a limiting factor in the nodes launching  with GX. A single 400 kW GX rack can support</p><ul><li>40 CPU-only blades (81,920 cores of Venice)</li><li>28 AMD Venice+MI430X blades (112 GPUs)</li><li>24 NVIDIA Vera+Rubin blades (192 GPUs)</li></ul><p>  For reference, the demo GX5000 rack pictured above had only 29 blades and 16  switches. I assume that fitting 40 blades into the rack requires using the  smallest dragonfly group possible.</p><p>  On the cooling front, the GX5000 rack will launch with support for the same  1.6 MW CDUs as the current Cray EX platform. I heard talk of a neat sidecar  CDU option as well, but the person with whom I spoke at the HPE booth said  that would come a little later.</p><p>  Overall, I was surprised by how un-exotic the new Cray GX platform is compared  to what the AI world has been doing with ORv3 racks. The fact that Cray and  Dell's designs are more similar than different suggests that the HPC/AI world  is converging on a place where the future is uncertain, and flexibility is  more important that highly engineered racks that optimize for very specific  nodes and networks. It also suggests that the real value of buying Cray is  higher up the stack; liquid cooling, power delivery, and rack integration is  becoming commoditized thanks to AI.</p><p>  I was also surprised that Cray's next-generation design is not obviously  superior to what the hyperscale community is designing. Whereas the GX rack  caps out at 400 kW, Dell's will allegedly scale up to 480 kW. That said,  today's IR7000 racks shipping for Horizon are only 215 kW (for GPU racks) and  100 kW (for CPU-only racks) according to a talk given by Dan Stanzione:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">The physical configuration of TACC's upcoming Horizon supercomputer.</figcaption></figure></div><p>So until the final specifications for the Rubin GPU are released, I suspect we  won't know whether Cray still leads the pack in terms of compute density, or  if Dell made the better bet by aligning its supercomputing platform on a  standard OCP rack design.</p>",
            "url": "https://hpc.social/personal-blog/2025/sc-25-recap/",
            
            
            
            
            
            "date_published": "2025-12-01T14:34:00-07:00",
            "date_modified": "2025-12-01T14:34:00-07:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/the-trap-of-prioritizing-impact/",
            "title": "The trap of prioritizing impact",
            "summary": null,
            "content_text": "(I wrote this originally as a comment in RLS in response to a staff-level engineer who was frustrated at how little they got to code anymore, and it resonated with enough folks that maybe it’s worth sharing here!)There’s a trap I’ve seen a lot of staff+ folks fall into where they over-prioritize the idea that they should always be doing “the right, most effective thing for the company”. When I see engineers complain that they don’t get to code enough, I often suspect they’ve fallen prey to this.I say that’s a trap! because I see people do this at the expense of their own job satisfaction and growth, which is bad for both them and (eventually) for the company which is likely to lose them.I don’t blame people for falling into this trap, it’s what we’re rewarded for. I’ve fallen into it! I have stopped doing technical work I cared about, prioritized #impact, and fought fires wherever they arose. I have spent all my time mentoring and teaching and none coding. The result was often grateful colleagues, but also burnout and leaving jobs I otherwise liked.Whereas when I’ve allowed myself to be like 30% selfish — picking some of my work because it was fun and technical, even when doing so was not the “most impactful” thing I could do — I was happier, learned more, and stayed in roles longer.An example: I worked on a team that was doing capacity planning poorly and was buying too much hardware. (On-prem, physical hardware.) I could have solved the problem with a spreadsheet, but that was boring and made my soul hurt.What I did instead was dig into how our container scheduling platform worked, and wrote a nifty little CLI tool that would look at the team’s configured workloads and spit out a capacity requirement calculation. It took about three times as long as the spreadsheet would have, but it was fun and accomplished the same goal and gave me some experience in the container platform. And it wasn’t that much of a time sink.Was that better for the company? No idea. I hope it was — I hear the tool is still maintained and no one has replaced it with a spreadsheet yet! But that’s a happy accident.Was it better for me? Absolutely! It was a bit selfish, but it made an otherwise tedious task more fun and I learned some useful tricks.So — if you wish you had more time to code… go code a bit more. Don’t let the idea of being more effective guilt you into giving it up. Your career is your career and you should enjoy it.",
            "content_html": "<p>(I wrote this originally as a comment in <a href=\"https://randsinrepose.com/welcome-to-rands-leadership-slack/\">RLS </a>in response to a staff-level engineer who was frustrated at how little they got to code anymore, and it resonated with enough folks that maybe it’s worth sharing here!)</p><p>There’s a trap I’ve seen a lot of staff+ folks fall into where they over-prioritize the idea that they should always be doing “the right, most effective thing for the company”. When I see engineers complain that they don’t get to code enough, I often suspect they’ve fallen prey to this.</p><p>I say <strong><em>that’s a trap</em></strong>! because I see people do this at the expense of their own job satisfaction and growth, which is bad for both them and (eventually) for the company which is likely to lose them.</p><p>I don’t blame people for falling into this trap, it’s what we’re rewarded for. I’ve fallen into it! I have stopped doing technical work I cared about, prioritized #impact, and fought fires wherever they arose. I have spent all my time mentoring and teaching and none coding. The result was often grateful colleagues, but also burnout and leaving jobs I otherwise liked.</p><p>Whereas when I’ve allowed myself to be like 30% selfish — picking some of my work because it was fun and technical, even when doing so was not the “most impactful” thing I could do — I was happier, learned more, and stayed in roles longer.</p><p>An example: I worked on a team that was doing capacity planning poorly and was buying too much hardware. (On-prem, physical hardware.) I could have solved the problem with a spreadsheet, but that was boring and made my soul hurt.</p><p>What I did instead was dig into how our container scheduling platform worked, and wrote a nifty little CLI tool that would look at the team’s configured workloads and spit out a capacity requirement calculation. It took about three times as long as the spreadsheet would have, but it was fun and accomplished the same goal and gave me some experience in the container platform. And it wasn’t that much of a time sink.</p><p>Was that better for the company? No idea. I hope it was — I hear the tool is still maintained and no one has replaced it with a spreadsheet yet! But that’s a happy accident.</p><p>Was it better for me? Absolutely! It was a bit selfish, but it made an otherwise tedious task more fun and I learned some useful tricks.</p><p>So — if you wish you had more time to code… go code a bit more. Don’t let the idea of being more effective guilt you into giving it up. Your career is your career and you should enjoy it.</p>",
            "url": "https://hpc.social/personal-blog/2025/the-trap-of-prioritizing-impact/",
            
            
            
            
            
            "date_published": "2025-09-20T14:46:41-06:00",
            "date_modified": "2025-09-20T14:46:41-06:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/lessons-learned-from-three-years-in-cloud-supercomputing/",
            "title": "Lessons learned from three years in cloud supercomputing",
            "summary": null,
            "content_text": "I recently decided to leave Microsoft after having spent just over three years there, first as a storage product manager, then as a compute engineer. Although I touched many parts of Azure's infrastructure during that time, everything I did was at the intersection of large-scale supercomputing and hyperscale cloud. There was no shortage of interesting systems to figure out and problems to solve, but as I began to wrap my arms around the totality of hyperscale AI training in the cloud, I also began to see the grand challenges that lay ahead.Outside Microsoft's Silicon Valley Campus minutes after I was escorted off the premises.Although many of those challenges would probably be fun and exciting to tackle, the more I learned, the more I found myself asking the same two questions: what did I want to do with the rest of my career, and was the path I was following going in the right direction? I spent a lot of time thinking about this, and my decision to leave Microsoft ultimately reflects the answer at which I arrived. But rather than indulge myself by recounting my introspection, I thought I would share some of the things that I learned while at Microsoft in the hopes that others find value in my experience.To that end, I've split this post into two sections:Things I've observed about HPC and technology trends from the perspective of a cloud/hyperscale/AI practitioner and provider, andThings I've realized about jobs and careers from the perspective of someone who's now worked in academia, a successful startup, government, and now Big Tech and is about halfway through his careerI consider this to be the concluding chapter of a three-part series that began with Life and leaving NERSC and continued with How has life after leaving the Labs been going.Also, please note that I authored this the day after my employment at Microsoft ended, and I was not beholden to any company or organization at the time of writing. The views expressed below are mine alone.HPCEverything I did at Microsoft touched supercomputers in one way or another, and my day job was exclusively supporting Microsoft's largest AI training supercomputers. Despite that, I did a lot of moonlighting in support of Azure's Federal business, and this is how I justified giving talks at events like like NERSC@50, SC, and Salishan in my last year. It's also what let me straddle both worlds: I had a rare, first-hand knowledge of how the de facto largest supercomputers in the world were built and used, and I had a front-row seat for how leaders in the traditional supercomputing world perceived (and sometimes misunderstood) what we were doing in the cloud.Before I get into specific observations though, I should clarify some nomenclature that I will use throughout:Supercomputers are the piles of compute nodes with a high-speed interconnect that are designed to solve one big problem in parallel. This is a generic term to describe the instrument, not its workload.HPC, traditional HPC, modsim, and scientific computing all refer to the ecosystem built around using something like MPI to solve a problem rooted in some type of science. Every big supercomputer run by DOE, procured through EuroHPC, and sited at the world-famous, government-funded supercomputer centers falls into this category.Cloud, hyperscale, and AI training all refer to the ecosystem built to train large language models. The supercomputers are run by hyperscale companies like Microsoft, Amazon, or Meta whose backgrounds have not historically been in the world of supercomputing.I realize that these are not very precise, but they're the easiest way to contrast what I learned inside Microsoft (a hyperscale cloud) with the world I came from prior (traditional HPC).HPC wants to be like the cloud, not in itWhen I left NERSC in May 2022, I speculated that the future of large-scale supercomputer centers would be follow one of two paths:They develop and squish cloud technologies into their supercomputers to make them more cloud-like, orThey abandon the idea of buying individual systems and instead enter into long-term relationships where flagship HPC systems are colocated inside cloud datacenters sited in places with low-cost, low-carbon power.I was hoping that the desire to continue building systems after passing the exascale milestone would make the next click-stop follow path #2, but early indications (across the global HPC landscape) are that the community has chosen path #1.HPC centers around the world are embracing the idea of cloudifying on-prem supercomputers by adding virtualization, containerization, and integration with other services to enable complex workflows. And as a part of that, they're reinventing many of the technology integrations that have always been first-class citizens in cloud: CSCS added capabilities to create \"versatile software-defined clusters\" on their latest Cray system, Alps. NERSC's next system, Doudna, is envisioned to allow its users to \"move from programming the supercomputer to programming the datacenter.\" But none of these systems are actually using commercial cloud services in non-trivial ways.In the year or two that followed ChatGPT, the notion of large-scale supercomputers in the cloud was a green field, and cloud providers were open to chasing all sorts of silly ideas. This made it the ideal time for the leadership HPC computing community to get a seat at the hyperscale table. Although their budgets couldn't compete with AI, HPC centers could've drafted on the investments of AI buildout and offered the societal impacts of using GPUs for science as a nice complement to the societal impacts of using GPUs for AI training.Much to my dismay, though, that window of opportunity was spent decrying the investment in hyperscale and AI rather than trying to exploit it; that window was the year of \"us versus them.\" And unfortunately, that window has essentially closed as accountants and CFOs have now sharpened their pencils and are searching for returns on the investments made in GPU infrastructure. The intrinsic value of supercomputing infrastructure in the cloud has been reduced to the point where Microsoft's CEO outright said they were turning away customers who just wanted to pay for GPU clusters, because higher-quality revenue could be made from inferencing services that use those same GPUs.So even if the HPC community woke up tomorrow and realized the long-term benefits of partnering with commercial clouds (instead of trying to copy them), I don't think cloud providers would respond with the same enthusiasm to meet in the middle now as they would have a year or two ago. I don't think this was a deliberate decision on behalf of the cloud providers, and they may not even fully realize this change. But the future of hyperscale supercomputing is rapidly crystallizing, and because HPC wasn't present in the solution, there's no room for it in the final structure.Cloud is expensive, but not for the reasons most thinkIt's been easy to write off the cloud as too expensive for HPC, and most people do silly math based on public list prices for VMs to justify their position. The narrative usually goes something like, \"if a single GPU VM costs $40/hr, then running 10,000 of them for five years will cost 17X more than our on-prem supercomputer!\" That's not how it works, and nobody pays that price. That $40/hr is the maximum possible price, and it includes the cost to the cloud provider of keeping nodes idle in the event that someone shows up and suddenly wants to use one on-demand.But even if you cut out all the profit for the cloud provider and just look at the cost of the physical infrastructure, building a supercomputer in the cloud is just more expensive than putting a bunch of whitebox nodes into a traditional HPC datacenter. There's a couple reasons for this, and here are a couple in no particular order:High availability: Every cloud datacenter has redundant power, and most of them have very redundant power. This is provisioned independently of whatever goes inside of that datacenter, so when you deploy a 10 MW supercomputer inside a 10 MW cloud datacenter, that comes with at least 10 MW of backup diesel generators, UPSes, and the electrical infrastructure. HPC workloads don't really need this, but it's hard to deploy HPC in the cloud without a ton of generators and UPSes coming along for the ride. This is changing with AI-specific cloud datacenters now being built, but these AI datacenters still have way more redundant power than a typical on-prem HPC datacenter. Building a cloud datacenter with the minimal redundancy that a traditional HPC datacenter has would mean that facility couldn't ever be used for anything but HPC, and that would undercut the overall flexibility upon which cloud economics are built.Cloud-side infrastructure: Every compute node has to be attached to the frontend cloud network in addition to a backend high-speed network like InfiniBand, unlike a traditional supercomputer where nodes are only attached to one high-speed network. While the cost of the smart NIC in each node is just a couple hundred dollars, every cloud supercomputer has to have a complete frontend network built out to support every single compute node--that's a ton of switches, routers, and fiber that must be properly provisioned all the way up to the cloud region in which those nodes are deployed. This frontend network is what enables all the cool cloud features on every node (full SDN, integration with other cloud services, etc), but these features aren't generally worth their cost when running meat-and-potatoes HPC workloads like MPI jobs by themselves. Their value only really shines through when executing complex workflows that, for example, couple an MPI job with stateful services and globally accessible data sharing with fine-grained access controls, all fully automated through programmable APIs and full RBAC.AI-optimized system architecture: AI-optimized GPU supercomputers contain a bunch of components that your typical Cray or Eviden simply wouldn't have. I wrote about the differences between AI and HPC supercomputers elsewhere, but in brief, AI workloads specifically benefit from having tens of terabytes of local SSDs and all-optical (no copper) RDMA fabrics. These add to the COGS (cost of goods sold) of an AI-optimized supercomputer, meaning that that a supercomputer with a thousand GPUs designed for AI is going to be more expensive than one designed for scientific computing no matter where it's deployed. And cloud providers are all optimizing their supercomputers for AI.There's a bunch of other cloud \"stuff\" that is required as well; every cloud region has a first footprint which is a LOT of general-purpose servers and storage that is required to support the basic cloud control plane. Before any user-facing cloud resources (including supercomputers) can be deployed, there has to be tens or hundreds of racks of this cloud \"stuff\" that is up and running. And although the cost of that first footprint is amortized over many customers in larger or older cloud regions, larger single-use infrastructures (like supercomputers) carry a proportionally larger fraction of the cost to deploy the first footprint.So when you look at the cost of running a single compute node in a cloud supercomputer, there are a bunch of extra ingredients baked in that you wouldn't get by just signing a check over to an OEM:a high availability SLA, afforded in part by all those generators and UPSesslick cloud service integrations, privacy features, virtual networking, afforded by that frontend cloud networkbetter performance for AI training or inferencing workloads, afforded by extra SSDs and all-optical interconnectsa bunch of other typical TCO stuff--the power consumed by the node, the opportunity cost of free floor tiles in your datacenter, and the engineers and technicians that keep it all runningUltimately, someone needs to pay for all of these extra ingredients. Cloud providers could just eat the costs themselves and sell the supercomputing service at a price comparable to what a customer would pay for an on-prem supercomputer--and sometimes they do. But this dilutes the profitability of the deal, and it increases the risks of the cloud provider losing money if unexpected issues arise during execution. Losing money is objectively bad business, so it's usually cloud customers who are paying for all these extra capabilities regardless of if they use them or not.So if all you want to do is run big MPI jobs, and you have no use for the extra availability, cloud integrations, privacy and security, and programmable infrastructure, sure--the price per-node is going to be higher in the cloud than on-prem. You're paying for a bunch of features that you don't need....Although sometimes it isSometimes buying a supercomputer in the cloud is straight up more expensive because of the value it provides though. For example, I remember a case where a large AI company needed to train a big LLM on many thousands of GPUs, so they signed an agreement which gave them exclusive access to a cloud supercomputer that strongly resembled a specific GPU system in the DOE complex. Because I used to work in the DOE, I knew how much DOE paid to buy their GPU cluster, and I also knew that three years of maintenance was included in that cost.What amazed me is what this AI company was willing to pay (roughly) the same price that DOE paid for their on-prem supercomputer, but in exchange, get exclusive access to a comparably capable cloud supercomputer (same GPUs model, similar GPU count, similar interconnect) for one year only. Put differently, being able to use a big, cutting-edge GPU cluster was worth up to 3x more to this AI company than it was to the DOE.While it may sound like I'm spilling secrets here, the reality is that anyone working for a cloud provider wouldn't be able to tell which AI deal I was describing here--they all look like this, and they're all willing to spend significantly more than the HPC community for the same compute capability. This gives you a sense of the real value that AI companies place on all the benefits that cloud-based supercomputers can provide.This isn't all bad for HPC, though. Every fat deal with an AI company means that there can be another deal with an HPC center that has slim margins. For example, let's say an AI company is willing to pay a billion dollars for a supercomputer whose TCO is only $330M--that means the cloud provider gets 67% margin. If the cloud provider's overall margin target is 50%, that means it can sell an identical supercomputer to an HPC customer at zero profit (for $330M) and still walk away happy. Thus, it is possible for the price of a supercomputer for HPC to be subsidized by all the money that the AI industry is throwing into supercomputing. Whether or not a cloud provider ever cuts deals like this is a business decision though--and as I said earlier, I don't think they're as open to silly ideas now as they used to be.The real hurdle that I was never able to overcome out, though, is a result of the fact that there is finite expertise in HPC and AI in the world. HPC-AI is ultimately a zero-sum game, and every hour spent working with an HPC customer is usually an hour that isn't being spent working with a much more profitable AI customer. I constantly ran into this problem working in hyperscale AI; my full-time job was to deal with AI customers, but I enjoyed interacting with HPC customers too. As a result, I had to do a lot of my the HPC-specific work (preparing conference presentations, for example) on nights, weekends, and vacations. It was just hard to tell people that I couldn't help improve job uptime on a massive training run because I was preparing a talk for a workshop that, frankly, might be openly hostile to my message.Influencing the cloud is hardBecause the difference in investment is so big between HPC and AI, many of the carrots that the HPC community has traditionally dangled in front of HPC vendors aren't very enticing to the hyperscale AI community. For example, both US and European HPC programs have relied heavily on non-recurring engineering (NRE) contracts with industry partners to incentivize the creation of products that are well-suited for scientific computing; PathFoward and Horizon 2020 both come to mind as well-funded, successful efforts on this front.However, HPC is the only customer community that really tries to do this, and it echoes a time when the HPC community was at the forefront of scale and innovation. Nowadays, the prospect of accepting $1M/year NRE contract to implement XYZ is completely unappetizing to a hyperscaler; it would probably cost more than $1M/year just to figure out how a company with $250 billion in annual revenue can handle such an unusual type of contract and payment. Add to to this the weird intellectual property rules (like disentangling a 40% cost sharing advance waiver for a tiny project within a multi-billion-dollar business), and it can become a corporate quagmire to go anywhere near NRE projects. Companies with well-insulated HPC silos can probably manage this better, but part of hyperscale economics is that everything overlaps with everything else as much as possible across supercomputing, general-purpose computing, hardware, and software.As a result of this, I really struggled to understand how a $20M/year service contract and a $1M/year NRE contract is materially different from a $21M/year service contract in the cloud world. For most (non-HPC) cloud customers, the RFP comes in saying \"we need XYZ\" and some product manager notes customer demand for XYZ. If the demand is large enough, the feature winds up on roadmap, and the cloud provider develops it as a part of regular business. If there is no other demand, then an NRE contract isn't really going to change that; maintaining feature XYZ long-term will cost far more than a couple million dollars, so implementing it would be a bad decision. This isn't unique to cloud, for what it's worth; while there are some successful HPC NRE stories, there are far more NRE-originated products that had no product-market fit and were simply abandoned after the associated supercomputer was retired.As best as I can tell, NRE has become a way for big HPC customers to maintain the illusion that they are influencing hyperscalers. A hyperscaler could propose some NRE, and an HPC buyer could fund it, and there could be weekly meetings where the two get together and pretend like they're collaborating and codesigning. The hyperscaler could write milestone reports, and they could attend quarterly business reviews with the customer. But this feels like an act. You simply can't move a $250B/year company that isn't solely organized around supercomputing with the lure of a couple million a year.This is not to say that NRE and codesign have no purpose in HPC! I'm sure component vendors (GPUs, networking, and the like) can make minor tweaks that offer big upside for the HPC community. But I learned that, as in several other dimensions, the HPC community is being pushed towards buying whatever is already on the truck, and NRE isn't going to have the impact that it once did.CareerIn addition to learning about how the hyperscale supercomputer world works in practice, my time at Microsoft exposed me to a segment of the supercomputing community that I didn't know existed: junior software engineers who were unwittingly thrown into the deep end of HPC straight out of college and were desperate to find their footing in both the technology and their careers overall. Maybe the most impactful work I did in the past three years was not technical at all, but instead came through some internal talks I gave on my professional journey in HPC and the one-on-one conversations that followed.Since I've gotten such positive feedback when I talk and write about this aspect of HPC, I'll also share some things I've learned about choosing the right employer and job during my time at Microsoft.People matterI learned that the right team matters more than the right job. It is profoundly important to me that I get to work with people with the same level of passion and curiosity, even if we are working on different problems.In retrospect, I realize that I have been very lucky that my career has progressed through organizations that were packed to the gills with people with whom I shared values. They wanted to go to conferences to share their work, they wanted to hear about how others are solving similar challenges, and they weren't afraid to present (and challenge) new ideas. As I learned over the last three years though, I think these traits are acutely concentrated in the HPC world since HPC itself originated from academia and a culture of independence and self-direction. They certainly aren't universal to all workplaces.To be clear, I am not saying that my coworkers at Microsoft weren't passionate or curious. But I did learn that, at big tech companies, you can have a perfectly successful career by keeping your head down and cranking away at the tasks given to you. If the work changes one day, it's actually a virtue to be able to walk away from the old project and turn your complete attention to a new one. Did the company just cancel the product you've been working on? No problem. If you were good at writing code for Windows update, you'll probably be just fine at coordinating planned maintenances for supercomputers. A colleague of mine called these people \"survivors,\" because they will do the best they can with whatever they're given.While this agility is great if you love programming, it can also engender numbness and dispassion for any specific application area. If a \"survivor\" can just as easily program for HoloLens as they can for GPU telemetry, they also likely don't really care about either HoloLens or GPUs. This isn't a bad thing, and I am certainly not passing judgment on people who don't care about GPUs. But it does mean that it's harder for someone who really cares about GPUs to connect with a teammate who really doesn't. And this has many knock-on effects in day-to-day work; it's only natural for people who share common values to help each other out, while relative strangers are less likely to go that extra mile. Finding that common ground to promote \"some person on team X\" to \"my trusted colleague on team X\" is that much harder.This difficulty in finding my community amidst all the survivors is what led me to look outside of my company to find my people. I went to events like the Smoky Mountains Conference and NERSC@50 and took the stage to literally beg the HPC community to give me a reason to work with them. By the letter of my job description, I was never supposed to be on stage; I was supposed to spending all my time behind my desk, thinking about the reliability of our biggest supercomputers. But I liked working with the people in the HPC community, and I liked working with our HPC sales organization, because we all shared common values; we were passionate about HPC and the mission of advancing scientific computing. So, I wound up spending a lot of time working on simple things with HPC folks and not enough time doing my day job.Company culture matters, tooIn an organization where individuals don't often share a lot of common ground, I learned that it's incumbent upon everyone to make a deliberate effort to maintain a culture of working together and helping each other out. A positive workplace culture won't happen by itself across a massive organization. To this end, Satya has a bunch of corporate culture mantras that are often repeated to keep reminding people of the way employees should treat each other.For example, he has a mantra of \"be a learn-it-all, not a know-it-all.\" But I found that many people struggled to really understand how to do this in practice; when confronted with a tough problem (\"your database keeps timing out when we point a thousand nodes at it at once\"), it's often too easy to just be a know-it-all (\"nobody else does that, so you are doing it wrong\") rather than a learn-it-all (\"why are you doing it that way?\"). And the older a company is, the harder it is for decades-long veterans to maintain openness to new challenges in the silo they've built around themselves.I've worked with HPC users for long enough to know that this attitude is pervasive anywhere you put a bunch of smart people with different perspectives into a room. However, it wasn't until I came to Microsoft that I learned that there's something to be gained by explicitly and repeatedly reminding people that they should strive to understand at least as much as they try to explain. Should I ever find myself in a leadership position, this is definitely a mantra I will carry with me and repeat to others, and I will credit my time at Microsoft with appreciating how to really live this mentality, not just parrot it.Being good at things isn't always a jobPeople tell me that I'm pretty good at a bunch of stuff: figuring out how technologies work, explaining complex concepts in understandable ways, and taking a critical look at data and figuring out what's missing. And I enjoy doing these things; this is why I post to my blog, maintain my digital garden, and love getting on stage and giving presentations. But people also say that, because I'm good at these things, there'd be no shortage of opportunities for me in the HPC industry should I ever go looking.However, I've learned that a job has to be an amalgamation of responsibilities that create value, and connecting \"things I'm good at\" with \"things that need to be done\" is not always straightforward. For example, if I am good at learning things and share what I learned with others, what kind of jobs actually turn that into a responsibility?Developers don't really do this at all. Their job is really to keep those git commits coming. Sometimes this requires learning new things, but writing blog posts or giving talks is not in the job description, so they don't count for much on performance reviews.Product managers do a little of this. I had to learn a few things and then repeat them a lot when I was a PM. Over and over. To customers, to executives, to partner teams. It was 5% learning and 95% sharing.Salespeople also do a little of this. They have to stay current on customer needs and product features, then repeat them a lot.System architects do a fair amount of this. I had to learn about what technologies are on the horizon, figure out how to piece them into an idea that could be implemented, then explain why it'd all be a good idea to others.Educators do a lot of this. The technology industry is always moving, so learning is required to stay up to date. They also get to be selective about the ideas worth sharing and downplay the rest.Each one of these roles has its own downsides too; for example, product managers and salespeople often have to nag people a lot, which I don't think anyone likes. And many of these roles require sharing knowledge with people who really don't want to hear it. After all, what customer is eager to talk to every salesperson who comes in the door?Trying to find the ideal job is not just a matter of being good at many things; it's a matter of finding specific jobs that contain a maximal number of things you're good at and a minimal number of things you don't want to do. It's an NP-hard problem, and I've come to realize that the only way to solve it is through trial-and-error. I'm sure some people get lucky and figure out the optimal path on their first try, but for the rest of us, the only way to approach the optimal path is to continuously reflect and not longer on a known-suboptimal path for any longer than is necessary.I've given up on trying to find the perfect job, because I've learned that it probably doesn't exist. I'm good at some things, I'm bad at some things; I enjoy some responsibilities, and I dislike some responsibilities. As with every other job I've had, I learned a lot about all four of these categories during my time at Microsoft, and my choice of next step has been informed by that. I don't expect it to be perfect, but I have high hopes that it will be a step in the right direction.You don't have to be your employerWhen I left the government for a corporate job, one of my biggest worries was losing credibility with peers whose opinions I respected. It's easy to dismiss the viewpoint of someone at a large vendor with a rationalization like, \"of course they'd say that; it's their job,\" but I learned that the HPC community isn't so reductive. People are smart, and most were willing to engage with the quality of my ideas before checking the affiliation on my conference badge.The trick, of course, was finding ways to share ideas in a way that didn't upset my corporate overlords but had substantive value to my audience. I think I figured this out, and in short, I found that leading with honesty and precision works best. The HPC community was built on sharing experiences and learnings about what does and doesn't work, so embracing that--rather than name-dropping products and making hyperbolic claims--seemed to keep me getting invited back to the HPC conferences and workshops that I wanted to attend.I wasn't completely intentional in building whatever credibility I've gained over the last three years, but I was intentional in avoiding work that would clearly compromise it. I never want to be accused of misrepresenting the limits of my understanding, so I will never present a slide containing statements or plots that I can't substantiate. I also never want to be accused of misrepresenting the truth, so I am as forthright as possible in disclosing when I do (or don't) have an incentive to say something.Because I stayed true to myself, I think I was the same person at Microsoft as I was at NERSC or SDSC. That continuity helped my peers quickly recalibrate after I became a vendor, and I think this helped me do more than if I had gone all-in on the role of a cloud spokesperson. Of course, there were times when I had to take on an employer-specific persona, but that's just business, and I've found that peers recognize that this is just a part of the game that we all must play.The result of all this wasn't clear to me until after I started telling people I was leaving Microsoft. There are a bunch of HPC-specific projects I undertook on the side (e.g., reviewing and advising on research, serving on panels), and I started notifying people that I would have to find other Microsoft engineers to take over these obligations since I was leaving. Much to my surprise though, everyone responded the same way: the request to have me help was specifically to me, not my employer. Short of any conflicts of interest, they didn't care who employed me and valued my contributions regardless of who was signing my paychecks.So, after three years working for an HPC vendor, I have learned that most people won't define you by your employer as long as you don't define yourself by your employer. It is possible to work for a company that sells HPC and still maintain your own identity as a person, but it requires thoughtful effort and a supportive (or indifferent!) employer. If you act like a company shill, you will be regarded as one, but not many jobs in industry actually require that to fulfill your responsibilities.Happiness sometimes costs moneyI think most people would agree that, while money can't buy happiness, it certainly helps. What I didn't realize until recently, though, is a reciprocal truth: sometimes happiness costs money.A year ago, I wrote about how the pay in industry compares to working at the national labs, and I described how my golden handcuffs were structured. An optimist might say that these vesting schedules are a way to keep a happy employee from being lured away, but I think it's equally common that these are truly handcuffs. They are a constant reminder that, even in the darkest of days, there is a six-figure reason to grit one's teeth and persevere.I've come to realize that there is an adverse correlation between a few factors:Smaller organizations offer more flexibility to mold a job around your preferences, because there is more work scope spread across fewer people.Larger organizations can afford to offer larger total compensation, but flexibility is limited to the scope of any single team.I kind of thought about it like this:When I realized that I should explore other paths, I had to determine where in this continuum I wanted to wind up: do I care more about a fat paycheck, or do I care more about enjoying my day-to-day responsibilities? And once offers started coming in, exactly how much of a pay cut was I willing to take in exchange for the flexibility that I would receive?By the time I handed in my resignation at Microsoft, I knew exactly how much this happiness was worth to me. Alternatively, I found out how much opportunity cost I was willing to pay for the ability (hopefully!) to reconnect with my day-to-day work. The calculus was an interesting exercise involving a bunch of Monte Carlo simulation which I won't detail here, but as it turns out, I was willing to pay a lot of money for the chance to do something that aligned more completely with what I wanted to do with the rest of my career. In the end, I gave up hundreds of thousands in unvested stock, and I am taking a six-figure pay cut in annual base pay when I start my next job. For me, though, this was a fair price to pay.Final thoughtsAfter three years in the world of hyperscale supercomputing, I have come away with two major learnings that now shape how I think about the future.On the technical front, I think the HPC community has chosen to keep going its own way and reinvent the cloud rather than work meaningfully with hyperscale cloud providers. There was a brief window of opportunity where the mountain may have actually come to Muhammed, and the trajectory of scientific computing could have fundamentally changed to align with the growth trajectory of hyperscale AI. However, I don't think the HPC community was ready to take a big swing during those early days post-ChatGPT or do an earnest assessment of what that future could've looked like. I also worry that the window has closed, and the HPC community never even realized what was on the table.On the career front, I've realized that success is multidimensional. Money is one axis, but so are impact, people, and purpose. The relative importance of each is not always obvious either; they only became clearer to me as I tried different jobs across the space. I've found that the ability to work with like-minded people and the opportunity to learn and share are the most important dimensions to me, but also I recognize that I am privileged in others. Finding stacks of money can be easy for those who work in AI, but there are no shortcuts to building (and retaining!) teams of great people. Anyone who can do the latter well should not be undervalued.There's a lot more that I didn't have time to organize and write, but I have every intention of continuing to be myself, regardless of where I work, in the future. I will keep writing, posting, and talking about what I'm learning in supercomputing whenever I can. And along those lines, I hope that writing all this out helps others figure out what's important to them and where they want to go.",
            "content_html": "<p>I recently decided to leave Microsoft after having spent just over three years there, first as a storage product manager, then as a compute engineer. Although I touched many parts of Azure's infrastructure during that time, everything I did was at the intersection of large-scale supercomputing and hyperscale cloud. There was no shortage of interesting systems to figure out and problems to solve, but as I began to wrap my arms around the totality of hyperscale AI training in the cloud, I also began to see the grand challenges that lay ahead.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Outside Microsoft's Silicon Valley Campus minutes after I was escorted off the premises.</figcaption></figure></div><p>Although many of those challenges would probably be fun and exciting to tackle, the more I learned, the more I found myself asking the same two questions: what did I want to do with the rest of my career, and was the path I was following going in the right direction? I spent a lot of time thinking about this, and my decision to leave Microsoft ultimately reflects the answer at which I arrived. But rather than indulge myself by recounting my introspection, I thought I would share some of the things that I learned while at Microsoft in the hopes that others find value in my experience.</p><p>To that end, I've split this post into two sections:</p><ol type=\"1\"><li>Things I've observed about <a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#hpc\"><strong>HPC and technology trends</strong></a> from the perspective of a cloud/hyperscale/AI practitioner and provider, and</li><li>Things I've realized about <a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#career\"><strong>jobs and careers</strong></a> from the perspective of someone who's now worked in <a href=\"https://www.sdsc.edu/\">academia</a>, a <a href=\"https://www.cnbc.com/2019/09/12/10x-genomics-txg-biotech-start-up-surges-in-ipo-debut.html\">successful startup</a>, <a href=\"https://www.nersc.gov/\">government</a>, and now <a href=\"https://www.microsoft.com/\">Big Tech</a> and is about halfway through his career</li></ol><p>I consider this to be the concluding chapter of a three-part series that began with <a href=\"https://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">Life and leaving NERSC</a> and continued with <a href=\"https://blog.glennklockwood.com/2024/08/how-has-life-after-leaving-labs-been.html\">How has life after leaving the Labs been going</a>.</p><p>Also, please note that I authored this the day after my employment at Microsoft ended, and I was not beholden to any company or organization at the time of writing. <i>The views expressed below are mine alone</i>.</p><!--<ul><ul><li><a href=\"#hpc\">HPC</a><ul><li><a href=\"#hpc-wants-to-be-like-the-cloud-not-in-it\">HPC wants to be like the cloud, not in it</a></li><li><a href=\"#cloud-is-expensive-but-not-for-the-reasons-most-think\">Cloud is expensive, but not for the reasons most think</a></li><li><a href=\"#although-sometimes-it-is\">...Although sometimes it is</a></li><li><a href=\"#influencing-the-cloud-is-hard\">Influencing the cloud is hard</a></li></ul></li><li><a href=\"#career\">Career</a><ul><li><a href=\"#people-matter\">People matter</a></li><li><a href=\"#company-culture-matters-too\">Company culture matters, too</a></li><li><a href=\"#being-good-at-things-isnt-always-a-job\">Being good at things isn't always a job</a></li><li><a href=\"#you-dont-have-to-be-your-employer\">You don't have to be your employer</a></li><li><a href=\"#happiness-sometimes-costs-money\">Happiness sometimes costs money</a></li></ul></li><li><a href=\"#final-thoughts\">Final thoughts</a></li></ul></ul>--><h2 id=\"hpc\">HPC</h2><p>Everything I did at Microsoft touched supercomputers in one way or another, and my day job was exclusively supporting Microsoft's largest AI training supercomputers. Despite that, I did a lot of moonlighting in support of Azure's Federal business, and this is how I justified giving talks at events like like <a href=\"https://sites.google.com/lbl.gov/nersc50-nug/home\">NERSC@50</a>, <a href=\"https://sc24.supercomputing.org\">SC</a>, and <a href=\"https://www.glennklockwood.com/garden/Salishan\">Salishan</a> in my last year. It's also what let me straddle both worlds: I had a rare, first-hand knowledge of how the <a href=\"https://www.glennklockwood.com/garden/systems/Eagle\">de facto largest supercomputers in the world</a> were built and used, and I had a front-row seat for how leaders in the traditional supercomputing world perceived (and sometimes misunderstood) what we were doing in the cloud.</p><p>Before I get into specific observations though, I should clarify some nomenclature that I will use throughout:</p><ul><li><strong>Supercomputers</strong> are the piles of compute nodes with a high-speed interconnect that are designed to solve one big problem in parallel. This is a generic term to describe the instrument, not its workload.</li><li><strong>HPC</strong>, <strong>traditional HPC</strong>, <strong>modsim</strong>, and <strong>scientific computing</strong> all refer to the ecosystem built around using something like MPI to solve a problem rooted in some type of science. Every big supercomputer run by DOE, procured through EuroHPC, and sited at the world-famous, government-funded supercomputer centers falls into this category.</li><li><strong>Cloud</strong>, <strong>hyperscale</strong>, and <strong>AI training</strong> all refer to the ecosystem built to train large language models. The supercomputers are run by hyperscale companies like Microsoft, Amazon, or Meta whose backgrounds have not historically been in the world of supercomputing.</li></ul><p>I realize that these are not very precise, but they're the easiest way to contrast what I learned inside Microsoft (a hyperscale cloud) with the world I came from prior (traditional HPC).</p><h3 id=\"hpc-wants-to-be-like-the-cloud-not-in-it\">HPC wants to be like the cloud, not in it</h3><p>When I left NERSC in May 2022, <a href=\"https://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">I speculated that the future of large-scale supercomputer centers</a> would be follow one of two paths:</p><ol type=\"1\"><li>They develop and squish cloud technologies into their supercomputers to make them more cloud-like, or</li><li>They abandon the idea of buying individual systems and instead enter into long-term relationships where flagship HPC systems are colocated inside cloud datacenters sited in places with low-cost, low-carbon power.</li></ol><p>I was hoping that the desire to continue building systems after passing the exascale milestone would make the next click-stop follow path #2, but early indications (across the global HPC landscape) are that the community has chosen path #1.</p><p>HPC centers around the world are embracing the idea of cloudifying on-prem supercomputers by adding virtualization, containerization, and integration with other services to enable complex workflows. And as a part of that, they're reinventing many of the technology integrations that have always been first-class citizens in cloud: CSCS added capabilities to create <a href=\"https://www.cscs.ch/publications/news/2024/new-research-infrastructure-alps-supercomputer-inaugurated\">\"versatile software-defined clusters\" on their latest Cray system, Alps</a>. NERSC's next system, Doudna, is envisioned to allow its users to \"<a href=\"https://www.vastdata.com/sharedeverything/how-nersc-is-rewriting-the-role-of-the-supercomputer\">move from programming the supercomputer to programming the datacenter</a>.\" But none of these systems are actually using commercial cloud services in non-trivial ways.</p><p>In the year or two that followed ChatGPT, the notion of large-scale supercomputers in the cloud was a green field, and cloud providers were open to chasing all sorts of silly ideas. This made it the ideal time for the leadership HPC computing community to get a seat at the hyperscale table. Although their budgets couldn't compete with AI, HPC centers could've drafted on the investments of AI buildout and offered the societal impacts of using GPUs for science as a nice complement to the societal impacts of using GPUs for AI training.</p><p>Much to my dismay, though, that window of opportunity was spent decrying the investment in hyperscale and AI rather than trying to exploit it; that window was the year of \"<a href=\"https://blog.glennklockwood.com/2024/05/isc24-recap.html#section11\">us versus them</a>.\" And unfortunately, that window has essentially closed as accountants and CFOs have now sharpened their pencils and are searching for returns on the investments made in GPU infrastructure. The intrinsic value of supercomputing infrastructure in the cloud has been reduced to the point where <a href=\"https://www.theregister.com/2024/10/31/microsoft_q1_fy_2025/\">Microsoft's CEO outright said they were turning away customers who just wanted to pay for GPU clusters</a>, because higher-quality revenue could be made from inferencing services that use those same GPUs.</p><p>So even if the HPC community woke up tomorrow and realized the long-term benefits of partnering with commercial clouds (instead of trying to copy them), I don't think cloud providers would respond with the same enthusiasm to meet in the middle now as they would have a year or two ago. I don't think this was a deliberate decision on behalf of the cloud providers, and they may not even fully realize this change. But the future of hyperscale supercomputing is rapidly crystallizing, and because HPC wasn't present in the solution, there's no room for it in the final structure.</p><h3 id=\"cloud-is-expensive-but-not-for-the-reasons-most-think\">Cloud is expensive, but not for the reasons most think</h3><p>It's been easy to write off the cloud as too expensive for HPC, and most people do silly math based on public list prices for VMs to justify their position. The narrative usually goes something like, \"<a href=\"https://info.ornl.gov/sites/publications/Files/Pub202373.pdf\">if a single GPU VM costs $40/hr, then running 10,000 of them for five years will cost 17X more than our on-prem supercomputer!</a>\" That's not how it works, and nobody pays that price. That $40/hr is the maximum possible price, and it includes the cost to the cloud provider of keeping nodes idle in the event that someone shows up and suddenly wants to use one on-demand.</p><p>But even if you cut out all the profit for the cloud provider and just look at the cost of the physical infrastructure, building a supercomputer in the cloud is just more expensive than putting a bunch of whitebox nodes into a traditional HPC datacenter. There's a couple reasons for this, and here are a couple in no particular order:</p><p><strong>High availability</strong>: Every cloud datacenter has redundant power, and most of them have <em>very</em> redundant power. This is provisioned independently of whatever goes inside of that datacenter, so when you deploy a 10 MW supercomputer inside a 10 MW cloud datacenter, that comes with at least 10 MW of backup diesel generators, UPSes, and the electrical infrastructure. HPC workloads don't really need this, but it's hard to deploy HPC in the cloud without a ton of generators and UPSes coming along for the ride. This is changing with AI-specific cloud datacenters now being built, but these AI datacenters still have way more redundant power than a typical on-prem HPC datacenter. Building a cloud datacenter with the minimal redundancy that a traditional HPC datacenter has would mean that facility couldn't ever be used for anything but HPC, and that would undercut the overall flexibility upon which cloud economics are built.</p><p><strong>Cloud-side infrastructure</strong>: Every compute node has to be attached to the frontend cloud network in addition to a backend high-speed network like InfiniBand, unlike a traditional supercomputer where nodes are only attached to one high-speed network. While the cost of the smart NIC in each node is just a couple hundred dollars, every cloud supercomputer has to have a complete frontend network built out to support every single compute node--that's a ton of switches, routers, and fiber that must be properly provisioned all the way up to the cloud region in which those nodes are deployed. This frontend network is what enables all the cool cloud features on every node (full SDN, integration with other cloud services, etc), but these features aren't generally worth their cost when running meat-and-potatoes HPC workloads like MPI jobs by themselves. Their value only really shines through when executing complex workflows that, for example, couple an MPI job with stateful services and globally accessible data sharing with fine-grained access controls, all fully automated through programmable APIs and full RBAC.</p><p><strong>AI-optimized system architecture</strong>: AI-optimized GPU supercomputers contain a bunch of components that your typical Cray or Eviden simply wouldn't have. I wrote about the <a href=\"https://www.glennklockwood.com/garden/differences-between-AI-and-HPC\">differences between AI and HPC supercomputers elsewhere</a>, but in brief, AI workloads specifically benefit from having tens of terabytes of local SSDs and all-optical (no copper) RDMA fabrics. These add to the COGS (cost of goods sold) of an AI-optimized supercomputer, meaning that that a supercomputer with a thousand GPUs designed for AI is going to be more expensive than one designed for scientific computing no matter where it's deployed. And cloud providers are all optimizing their supercomputers for AI.</p><p>There's a bunch of other cloud \"stuff\" that is required as well; every cloud region has a first footprint which is a LOT of general-purpose servers and storage that is required to support the basic cloud control plane. Before any user-facing cloud resources (including supercomputers) can be deployed, there has to be tens or hundreds of racks of this cloud \"stuff\" that is up and running. And although the cost of that first footprint is amortized over many customers in larger or older cloud regions, larger single-use infrastructures (like supercomputers) carry a proportionally larger fraction of the cost to deploy the first footprint.</p><p>So when you look at the cost of running a single compute node in a cloud supercomputer, there are a bunch of extra ingredients baked in that you wouldn't get by just signing a check over to an OEM:</p><ul><li>a high availability SLA, afforded in part by all those generators and UPSes</li><li>slick cloud service integrations, privacy features, virtual networking, afforded by that frontend cloud network</li><li>better performance for AI training or inferencing workloads, afforded by extra SSDs and all-optical interconnects</li><li>a bunch of other typical TCO stuff--the power consumed by the node, the opportunity cost of free floor tiles in your datacenter, and the engineers and technicians that keep it all running</li></ul><p>Ultimately, someone needs to pay for all of these extra ingredients. Cloud providers <em>could</em> just eat the costs themselves and sell the supercomputing service at a price comparable to what a customer would pay for an on-prem supercomputer--and sometimes they do. But this dilutes the profitability of the deal, and it increases the risks of the cloud provider losing money if unexpected issues arise during execution. Losing money is objectively bad business, so it's usually cloud customers who are paying for all these extra capabilities regardless of if they use them or not.</p><p>So if all you want to do is run big MPI jobs, and you have no use for the extra availability, cloud integrations, privacy and security, and programmable infrastructure, sure--the price per-node is going to be higher in the cloud than on-prem. You're paying for a bunch of features that you don't need.</p><h3 id=\"although-sometimes-it-is\">...Although sometimes it is</h3><p>Sometimes buying a supercomputer in the cloud is straight up more expensive because of the value it provides though. For example, I remember a case where a large AI company needed to train a big LLM on many thousands of GPUs, so they signed an agreement which gave them exclusive access to a cloud supercomputer that strongly resembled a specific GPU system in the DOE complex. Because I used to work in the DOE, I knew how much DOE paid to buy their GPU cluster, and I also knew that three years of maintenance was included in that cost.</p><p>What amazed me is what this AI company was willing to pay (roughly) the same price that DOE paid for their on-prem supercomputer, but in exchange, get exclusive access to a comparably capable cloud supercomputer (same GPUs model, similar GPU count, similar interconnect) for <em>one year only</em>. Put differently, being able to use a big, cutting-edge GPU cluster was worth up to 3x more to this AI company than it was to the DOE.</p><p>While it may sound like I'm spilling secrets here, the reality is that anyone working for a cloud provider wouldn't be able to tell which AI deal I was describing here--they all look like this, and they're all willing to spend significantly more than the HPC community for the same compute capability. This gives you a sense of the real value that AI companies place on all the benefits that cloud-based supercomputers can provide.</p><p>This isn't all bad for HPC, though. Every fat deal with an AI company means that there can be another deal with an HPC center that has slim margins. For example, let's say an AI company is willing to pay a billion dollars for a supercomputer whose TCO is only $330M--that means the cloud provider gets 67% margin. If the cloud provider's overall margin target is 50%, that means it can sell an identical supercomputer to an HPC customer at zero profit (for $330M) and still walk away happy. Thus, it is possible for the price of a supercomputer for HPC to be subsidized by all the money that the AI industry is throwing into supercomputing. Whether or not a cloud provider ever cuts deals like this is a business decision though--and as I said earlier, I don't think they're as open to silly ideas now as they used to be.</p><p>The real hurdle that I was never able to overcome out, though, is a result of the fact that there is finite expertise in HPC and AI in the world. HPC-AI is ultimately a zero-sum game, and every hour spent working with an HPC customer is usually an hour that isn't being spent working with a much more profitable AI customer. I constantly ran into this problem working in hyperscale AI; my full-time job was to deal with AI customers, but I enjoyed interacting with HPC customers too. As a result, I had to do a lot of my the HPC-specific work (preparing conference presentations, for example) on nights, weekends, and vacations. It was just hard to tell people that I couldn't help improve job uptime on a massive training run because I was preparing a talk for a workshop that, frankly, might be openly hostile to my message.</p><h3 id=\"influencing-the-cloud-is-hard\">Influencing the cloud is hard</h3><p>Because the difference in investment is so big between HPC and AI, many of the carrots that the HPC community has traditionally dangled in front of HPC vendors aren't very enticing to the hyperscale AI community. For example, both US and European HPC programs have relied heavily on non-recurring engineering (NRE) contracts with industry partners to incentivize the creation of products that are well-suited for scientific computing; <a href=\"https://www.energy.gov/articles/department-energy-awards-six-research-contracts-totaling-258-million-accelerate-us\">PathFoward</a> and <a href=\"https://research-and-innovation.ec.europa.eu/funding/funding-opportunities/funding-programmes-and-open-calls/horizon-2020_en\">Horizon 2020</a> both come to mind as well-funded, successful efforts on this front.</p><p>However, HPC is the only customer community that really tries to do this, and it echoes a time when the HPC community was at the forefront of scale and innovation. Nowadays, the prospect of accepting $1M/year NRE contract to implement XYZ is completely unappetizing to a hyperscaler; it would probably cost more than $1M/year just to figure out how a company with <a href=\"https://www.microsoft.com/investor/reports/ar24/\">$250 billion in annual revenue</a> can handle such an unusual type of contract and payment. Add to to this the weird intellectual property rules (like disentangling a <a href=\"https://www.energy.gov/gc/articles/advance-patent-waiver-wa2017-007?utm_source=chatgpt.com\">40% cost sharing advance waiver</a> for a tiny project within a multi-billion-dollar business), and it can become a corporate quagmire to go anywhere near NRE projects. Companies with well-insulated HPC silos can probably manage this better, but part of hyperscale economics is that everything overlaps with everything else as much as possible across supercomputing, general-purpose computing, hardware, and software.</p><p>As a result of this, I really struggled to understand how a $20M/year service contract and a $1M/year NRE contract is materially different from a $21M/year service contract in the cloud world. For most (non-HPC) cloud customers, the RFP comes in saying \"we need XYZ\" and some product manager notes customer demand for XYZ. If the demand is large enough, the feature winds up on roadmap, and the cloud provider develops it as a part of regular business. If there is no other demand, then an NRE contract isn't really going to change that; maintaining feature XYZ long-term will cost far more than a couple million dollars, so implementing it would be a bad decision. This isn't unique to cloud, for what it's worth; while there are some successful HPC NRE stories, there are far more NRE-originated products that had no product-market fit and were <a href=\"https://cug.org/proceedings/cug2016_proceedings/includes/files/pap105s2-file1.pdf\">simply abandoned</a> after the associated supercomputer was retired.</p><p>As best as I can tell, NRE has become a way for big HPC customers to maintain the illusion that they are influencing hyperscalers. A hyperscaler could propose some NRE, and an HPC buyer could fund it, and there could be weekly meetings where the two get together and pretend like they're collaborating and codesigning. The hyperscaler could write milestone reports, and they could attend quarterly business reviews with the customer. But this feels like an act. You simply can't move a $250B/year company that isn't solely organized around supercomputing with the lure of a couple million a year.</p><p>This is not to say that NRE and codesign have no purpose in HPC! I'm sure component vendors (GPUs, networking, and the like) can make minor tweaks that offer big upside for the HPC community. But I learned that, as in several other dimensions, the HPC community is being pushed towards buying whatever is already on the truck, and NRE isn't going to have the impact that it once did.</p><h2 id=\"career\">Career</h2><p>In addition to learning about how the hyperscale supercomputer world works in practice, my time at Microsoft exposed me to a segment of the supercomputing community that I didn't know existed: junior software engineers who were unwittingly thrown into the deep end of HPC straight out of college and were desperate to find their footing in both the technology and their careers overall. Maybe the most impactful work I did in the past three years was not technical at all, but instead came through some internal talks I gave on my professional journey in HPC and the one-on-one conversations that followed.</p><p>Since I've gotten such positive feedback when I talk and write about this aspect of HPC, I'll also share some things I've learned about choosing the right employer and job during my time at Microsoft.</p><h3 id=\"people-matter\">People matter</h3><p>I learned that the right team matters more than the right job. It is profoundly important to me that I get to work with people with the same level of passion and curiosity, even if we are working on different problems.</p><p>In retrospect, I realize that I have been very lucky that my career has progressed through organizations that were packed to the gills with people with whom I shared values. They wanted to go to conferences to share their work, they wanted to hear about how others are solving similar challenges, and they weren't afraid to present (and challenge) new ideas. As I learned over the last three years though, I think these traits are acutely concentrated in the HPC world since HPC itself originated from academia and a culture of independence and self-direction. They certainly aren't universal to all workplaces.</p><p>To be clear, I am not saying that my coworkers at Microsoft weren't passionate or curious. But I did learn that, at big tech companies, you can have a perfectly successful career by keeping your head down and cranking away at the tasks given to you. If the work changes one day, it's actually a virtue to be able to walk away from the old project and turn your complete attention to a new one. Did the company just <a href=\"https://www.theverge.com/2024/10/1/24259369/microsoft-hololens-2-discontinuation-support\">cancel the product you've been working on</a>? No problem. If you were good at writing code for Windows update, you'll probably be just fine at coordinating planned maintenances for supercomputers. A colleague of mine called these people \"survivors,\" because they will do the best they can with whatever they're given.</p><p>While this agility is great if you love programming, it can also engender numbness and dispassion for any specific application area. If a \"survivor\" can just as easily program for HoloLens as they can for GPU telemetry, they also likely don't really <em>care</em> about either HoloLens or GPUs. This isn't a bad thing, and I am certainly not passing judgment on people who don't care about GPUs. But it does mean that it's harder for someone who really cares about GPUs to connect with a teammate who really doesn't. And this has many knock-on effects in day-to-day work; it's only natural for people who share common values to help each other out, while relative strangers are less likely to go that extra mile. Finding that common ground to promote \"some person on team X\" to \"my trusted colleague on team X\" is that much harder.</p><p>This difficulty in finding my community amidst all the survivors is what led me to look outside of my company to find my people. I went to events like the <a href=\"https://www.olcf.ornl.gov/tag/smoky-mountain-conference/\">Smoky Mountains Conference</a> and <a href=\"https://sites.google.com/lbl.gov/nersc50-nug/home\">NERSC@50</a> and took the stage to literally beg the HPC community to give me a reason to work with them. By the letter of my job description, I was never supposed to be on stage; I was supposed to spending all my time behind my desk, thinking about the reliability of our biggest supercomputers. But I liked working with the people in the HPC community, and I liked working with our HPC sales organization, because we all shared common values; we were passionate about HPC and the mission of advancing scientific computing. So, I wound up spending a lot of time working on simple things with HPC folks and not enough time doing my day job.</p><h3 id=\"company-culture-matters-too\">Company culture matters, too</h3><p>In an organization where individuals don't often share a lot of common ground, I learned that it's incumbent upon everyone to make a deliberate effort to maintain a culture of working together and helping each other out. A positive workplace culture won't happen by itself across a massive organization. To this end, Satya has a bunch of corporate culture mantras that are often repeated to keep reminding people of the way employees should treat each other.</p><p>For example, he has a mantra of \"<a href=\"https://www.msn.com/en-us/money/other/how-satya-nadella-created-a-learn-it-all-culture-at-microsoft-to-help-it-become-a-3-trillion-powerhouse/ar-BB1qWoRY\">be a learn-it-all, not a know-it-all</a>.\" But I found that many people struggled to really understand how to do this in practice; when confronted with a tough problem (\"your database keeps timing out when we point a thousand nodes at it at once\"), it's often too easy to just be a know-it-all (\"nobody else does that, so you are doing it wrong\") rather than a learn-it-all (\"why are you doing it that way?\"). And the older a company is, the harder it is for decades-long veterans to maintain openness to new challenges in the silo they've built around themselves.</p><p>I've worked with HPC users for long enough to know that this attitude is pervasive anywhere you put a bunch of smart people with different perspectives into a room. However, it wasn't until I came to Microsoft that I learned that there's something to be gained by explicitly and repeatedly reminding people that they should strive to understand at least as much as they try to explain. Should I ever find myself in a leadership position, this is definitely a mantra I will carry with me and repeat to others, and I will credit my time at Microsoft with appreciating how to really live this mentality, not just parrot it.</p><h3 id=\"being-good-at-things-isnt-always-a-job\">Being good at things isn't always a job</h3><p>People tell me that I'm pretty good at a bunch of stuff: figuring out how technologies work, explaining complex concepts in understandable ways, and taking a critical look at data and figuring out what's missing. And I enjoy doing these things; this is why I post to <a href=\"https://blog.glennklockwood.com/\">my blog</a>, maintain <a href=\"https://www.glennklockwood.com/garden/\">my digital garden</a>, and love <a href=\"https://www.youtube.com/playlist?list=PLtPey-3r1oZS0S5pPcWq-L4yrT9-R0gIm\">getting on stage and giving presentations</a>. But people also say that, because I'm good at these things, there'd be no shortage of opportunities for me in the HPC industry should I ever go looking.</p><p>However, I've learned that a <em>job</em> has to be an amalgamation of <em>responsibilities</em> that create value, and connecting \"things I'm good at\" with \"things that need to be done\" is not always straightforward. For example, if I am <em>good at</em> learning things and share what I learned with others, what kind of jobs actually turn that into a <em>responsibility</em>?</p><ul><li><strong>Developers</strong> don't really do this at all. Their job is really to keep those git commits coming. Sometimes this requires learning new things, but writing blog posts or giving talks is not in the job description, so they don't count for much on performance reviews.</li><li><strong>Product managers</strong> do a little of this. I had to learn a few things and then repeat them a lot when I was a PM. Over and over. To customers, to executives, to partner teams. It was 5% learning and 95% sharing.</li><li><strong>Salespeople</strong> also do a little of this. They have to stay current on customer needs and product features, then repeat them a lot.</li><li><strong>System architects</strong> do a fair amount of this. I had to learn about what technologies are on the horizon, figure out how to piece them into an idea that could be implemented, then explain why it'd all be a good idea to others.</li><li><strong>Educators</strong> do a lot of this. The technology industry is always moving, so learning is required to stay up to date. They also get to be selective about the ideas worth sharing and downplay the rest.</li></ul><p>Each one of these roles has its own downsides too; for example, product managers and salespeople often have to nag people a lot, which I don't think anyone likes. And many of these roles require sharing knowledge with people who really don't want to hear it. After all, what customer is eager to talk to every salesperson who comes in the door?</p><p>Trying to find the ideal job is not just a matter of being good at many things; it's a matter of finding specific jobs that contain a maximal number of things you're good at and a minimal number of things you don't want to do. It's an NP-hard problem, and I've come to realize that the only way to solve it is through trial-and-error. I'm sure some people get lucky and figure out the optimal path on their first try, but for the rest of us, the only way to approach the optimal path is to continuously reflect and not longer on a known-suboptimal path for any longer than is necessary.</p><p>I've given up on trying to find the perfect job, because I've learned that it probably doesn't exist. I'm good at some things, I'm bad at some things; I enjoy some responsibilities, and I dislike some responsibilities. As with every other job I've had, I learned a lot about all four of these categories during my time at Microsoft, and my choice of next step has been informed by that. I don't expect it to be perfect, but I have high hopes that it will be a step in the right direction.</p><h3 id=\"you-dont-have-to-be-your-employer\">You don't have to be your employer</h3><p>When I left the government for a corporate job, one of my biggest worries was losing credibility with peers whose opinions I respected. It's easy to dismiss the viewpoint of someone at a large vendor with a rationalization like, \"of course they'd say that; it's their job,\" but I learned that the HPC community isn't so reductive. People are smart, and most were willing to engage with the quality of my ideas before checking the affiliation on my conference badge.</p><p>The trick, of course, was finding ways to share ideas in a way that didn't upset my corporate overlords but had substantive value to my audience. I think I figured this out, and in short, I found that leading with honesty and precision works best. The HPC community was built on sharing experiences and learnings about what does and doesn't work, so embracing that--rather than name-dropping products and making hyperbolic claims--seemed to keep me getting invited back to the HPC conferences and workshops that I wanted to attend.</p><p>I wasn't completely intentional in building whatever credibility I've gained over the last three years, but I was intentional in avoiding work that would clearly compromise it. I never want to be accused of misrepresenting the limits of my understanding, so I will never present a slide containing statements or plots that I can't substantiate. I also never want to be accused of misrepresenting the truth, so I am as forthright as possible in disclosing when I do (or don't) have an incentive to say something.</p><p>Because I stayed true to myself, I think I was the same person at Microsoft as I was at NERSC or SDSC. That continuity helped my peers quickly recalibrate after I became a vendor, and I think this helped me do more than if I had gone all-in on the role of a cloud spokesperson. Of course, there were times when I had to take on an employer-specific persona, but that's just business, and I've found that peers recognize that this is just a part of the game that we all must play.</p><p>The result of all this wasn't clear to me until after I started telling people I was leaving Microsoft. There are a bunch of HPC-specific projects I undertook on the side (e.g., reviewing and advising on research, serving on panels), and I started notifying people that I would have to find other Microsoft engineers to take over these obligations since I was leaving. Much to my surprise though, everyone responded the same way: the request to have me help was specifically to me, not my employer. Short of any conflicts of interest, they didn't care who employed me and valued my contributions regardless of who was signing my paychecks.</p><p>So, after three years working for an HPC vendor, I have learned that most people won't define you by your employer as long as you don't define yourself by your employer. It is possible to work for a company that sells HPC and still maintain your own identity as a person, but it requires thoughtful effort and a supportive (or indifferent!) employer. If you act like a company shill, you will be regarded as one, but not many jobs in industry actually <em>require</em> that to fulfill your responsibilities.</p><h3 id=\"happiness-sometimes-costs-money\">Happiness sometimes costs money</h3><p>I think most people would agree that, while money can't buy happiness, it certainly helps. What I didn't realize until recently, though, is a reciprocal truth: sometimes happiness costs money.</p><p>A year ago, I wrote about <a href=\"https://blog.glennklockwood.com/2024/08/how-has-life-after-leaving-labs-been.html#pay-good\">how the pay in industry compares to working at the national labs</a>, and I described how my golden handcuffs were structured. An optimist might say that these vesting schedules are a way to keep a happy employee from being lured away, but I think it's equally common that these are truly handcuffs. They are a constant reminder that, even in the darkest of days, there is a six-figure reason to grit one's teeth and persevere.</p><p>I've come to realize that there is an adverse correlation between a few factors:</p><ul><li>Smaller organizations offer more flexibility to mold a job around your preferences, because there is more work scope spread across fewer people.</li><li>Larger organizations can afford to offer larger total compensation, but flexibility is limited to the scope of any single team.</li></ul><p>I kind of thought about it like this:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>When I realized that I should explore other paths, I had to determine where in this continuum I wanted to wind up: do I care more about a fat paycheck, or do I care more about enjoying my day-to-day responsibilities? And once offers started coming in, exactly how much of a pay cut was I willing to take in exchange for the flexibility that I would receive?</p><p>By the time I handed in my resignation at Microsoft, I knew exactly how much this happiness was worth to me. Alternatively, I found out how much opportunity cost I was willing to pay for the ability (hopefully!) to reconnect with my day-to-day work. The calculus was an interesting exercise involving a bunch of Monte Carlo simulation which I won't detail here, but as it turns out, I was willing to pay a lot of money for the chance to do something that aligned more completely with what I wanted to do with the rest of my career. In the end, I gave up hundreds of thousands in unvested stock, and I am taking a six-figure pay cut in annual base pay when I start my next job. For me, though, this was a fair price to pay.</p><h2 id=\"final-thoughts\">Final thoughts</h2><p>After three years in the world of hyperscale supercomputing, I have come away with two major learnings that now shape how I think about the future.</p><p>On the technical front, I think the HPC community has chosen to keep going its own way and reinvent the cloud rather than work meaningfully with hyperscale cloud providers. There was a brief window of opportunity where <a href=\"https://idiomorigins.org/origin/if-the-mountain-wont-come-to-muhammad-then-muhammed-must-go-to-the-mountain\">the mountain may have actually come to Muhammed</a>, and the trajectory of scientific computing could have fundamentally changed to align with the growth trajectory of hyperscale AI. However, I don't think the HPC community was ready to take a big swing during those early days post-ChatGPT or do an earnest assessment of what that future could've looked like. I also worry that the window has closed, and the HPC community never even realized what was on the table.</p><p>On the career front, I've realized that success is multidimensional. Money is one axis, but so are impact, people, and purpose. The relative importance of each is not always obvious either; they only became clearer to me as I tried different jobs across the space. I've found that the ability to work with like-minded people and the opportunity to learn and share are the most important dimensions to me, but also I recognize that I am privileged in others. Finding stacks of money can be easy for those who work in AI, but there are no shortcuts to building (and retaining!) teams of great people. Anyone who can do the latter well should not be undervalued.</p><p>There's a lot more that I didn't have time to organize and write, but I have every intention of continuing to be myself, regardless of where I work, in the future. I will keep writing, posting, and talking about what I'm learning in supercomputing whenever I can. And along those lines, I hope that writing all this out helps others figure out what's important to them and where they want to go.</p>",
            "url": "https://hpc.social/personal-blog/2025/lessons-learned-from-three-years-in-cloud-supercomputing/",
            
            
            
            
            
            "date_published": "2025-07-11T05:26:00-06:00",
            "date_modified": "2025-07-11T05:26:00-06:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/isc-25-recap/",
            "title": "ISC'25 recap",
            "summary": null,
            "content_text": "I had the pleasure of attending the 40th annual ISC High Performance conference this month in Hamburg, Germany. It was a delightful way to take the pulse of the high-performance computing community and hear what the top minds in the field are thinking about.The main foyer of Congress Center Hamburg, and the view that greeted me on the first morning of ISC'25. The conference felt a little quieter than usual this year, and there didn't seem to be as many big ideas and bold claims as in years past. There was a new Top 10 system announced, but it was built using previous-generation Hopper GPUs. There were a record number of exhibitors, but many of the big ones (Intel, AMD; the big three cloud providers) were all absent. And while there were some exciting new technologies (like AMD MI350-series GPUs and Ultra Ethernet v1.0) debuting during the week, they actually debuted elsewhere and were simply referenced throughout the week's talks.This year's ISC really felt like the place where the big news of the industry was being repeated in the context of scientific computing instead of being stated for the first time. And maybe this is the future of HPC conferences: rather than being where new technology is announced, perhaps ISC will become where the scientific community tries to figure out how they can use others' technology to solve problems. That idea--figuring out how to make use of whatever the AI industry is releasing--was certainly pervasive throughout the ISC program this year. The conference's theme of \"connecting the dots\" felt very appropriate as a result; rather than defining new dots, the conference was all about trying to make sense of the dots that have already been drawn.I took plenty of notes to try to keep track of everything that was being discussed, and as has become tradition, I've tried to summarize some of the key themes in this post.Table of contentsZettascaleOzaki, Ozaki, OzakiTop500JUPITERHPC-AI system intersectionOther new entrantsHPC around the worldHPC in ChinaElsewhere in AsiaThe Middle EastExhibitorsCloud, or lack thereofParting thoughtsZettascaleNow that exascale is squarely in the rear-view mirror of HPC, an increasing number of high-profile speakers began pushing on zettascale as the next major milestone. Like the early days of exascale, most of the discourse was less about what can be achieved with zettascale and more about the technology challenges that need to be surmounted for HPC to continue moving forward. And to that end, using zettascale to justify tackling big hardware and software challenges wasn't a bad thing, but it felt like every talk about zettascale this year was still more fanciful than anything else.The opening keynote, \"HPC and Al - A Path Towards Sustainable Innovation\" was delivered by a duo of CTOs: Mark Papermaster (of AMD) and Scott Atchley (of Oak Ridge Leadership Computing Facility). It was a textbook keynote: it had inspiring plots going up and to the right that showed huge potential! It had scary linear extrapolations showing that staying the course won't do! It had amazing science results enabled by big iron! It even had a surprise product debut in MI355X! ChatGPT couldn't have come up with a better structure for a keynote presentation. But as is my wont, I listened to the talk with a little skepticism and found myself raising an eyebrow a few times.A part of Papermaster's presentation involved an extrapolation to zettascale by 2035 and claimed that HPC is approaching an \"energy wall:\"Extrapolating ten years on a semilog plot is a great way to cause alarm in people who don't pay close attention to axes.He specifically said that we'd need 1 GW per supercomputer to reach zettascale by 2035 on the current trajectory. He then used this to motivate \"holistic co-design\" as the only way to reach zettascale, and he went on to talk about all the same things we heard about leading up to exascale: increase locality and integration to reduce power and increase performance.While I agree that we should aspire to do better than a gigawatt datacenter, this notion that there is an \"energy wall\" that stands between us and zettascale is a bit farcical; there's nothing special about a 1 GW zettascale supercomputer, just like there was nothing special about 20 MW for exascale. You might argue that building a supercomputer that consumes all the power of a nuclear reactor might be fundamentally more difficult than one that consumes only 20 MW, and you'd be right--which is why the first gigawatt supercomputers probably aren't going to look like the supercomputers of today.Papermaster's \"energy wall\" slide reminded me of the great horse manure crisis of 1984, where people extrapolated from today using an evolutionary, not revolutionary, trajectory. If building a single gigawatt supercomputer is inconceivable, then build four 250 MW supercomputers and put a really fast network between them to support a single, synchronous job. The AI industry is already headed down this road; Google, Microsoft, and OpenAI have already talked about how they synchronously train across multiple supercomputers, and Microsoft announced their 400 Tb/s \"AI WAN\" for this last month as a means to enabling wide-area training.Granted, it's unlikely that the HPC community will be building massive, distributed supercomputers the way hyperscale is. But I was disappointed that the keynote only went as far as saying \"a gigawatt supercomputer is crazy, so we need codesign at the node/rack scale.\" Codesign to reach zettascale will probably require a whole new approach that, for example, accounts for algorithms that synchronize communication across multiple datacenters and power plants. The infrastructure for that is already forming, with the US developing its Integrated Research Infrastructure (IRI) and Europe shaping up to have over a dozen AI factories. Zettascale by 2035 may very well exist for the scientific computing community, but it'll probably look a lot more like hyperscale zettascale rather than a single massive building. A single machine plugged into a gigawatt nuclear reactor only happens if business-as-usual is extrapolated out another ten years as Papermaster did, and the codesign required to achieve that isn't very meaningful.Prof. Satoshi Matsuoka also gave a talk on the big stage about Fugaku-NEXT, which Japan has branded as a zettascale system. His vision, which will be realized before 2030, aims to deploy a single, 40 MW supercomputer (much like Fugaku was) where:10x-20x speedup comes from hardware improvements2x-8x speedup comes from mixed precision or emulation (more on this below)10x-25x speedup comes from surrogate models or physics-informed neural networksThe net result is a 200x-4000x speedup over Fugaku. His rationale is that this will result in a system that is effectively equivalent to somewhere between 88 EF and 1.7 ZF FP64. It's not literally doing that many calculations per second, but the science outcomes are equivalent to a brute-force approach using a much larger system.I thought this approach to reaching zettascale was much more realistic than the Papermaster approach, but it does require the scientific computing community to redefine its metrics of success. If HPL was a bad benchmark for exascale, it is irrelevant to zettascale since it's unlikely that anyone will ever run HPL on a zettascale system. At best, we'll probably see something like HPL-MxP that captures the 10x-20x hardware speedup and the 2x-8x mixed-precision or emulated FP64 reach hundreds of exaflops, but the 10x-25x from surrogate models will be domain-specific and defy simplistic ranking. If I had to guess, the first zettascale systems will be benchmarked through Gordon Bell prize papers that say things like \"simulating this result using conventional FP64 would have required over 1 ZF for 24 hours.\"Ozaki, Ozaki, OzakiAlthough Prof. Matsuoka evoked the 2x-8x speedup from mixed precision or emulation when claiming Fugaku-NEXT would be zettascale, he was far from the only speaker to talk about mixed precision and emulation. In fact, it seemed like everyone wanted to talk about emulating FP64, specifically using NVIDIA's low-precision tensor cores and the Ozaki scheme (or its derivatives). By the end of the week, I was absolutely sick of hearing about Ozaki.For the unindoctrinated, this Ozaki scheme (and similar methods with less-catchy names) is a way to emulate matrix-matrix multiplications at high precision using low-precision matrix operations. It's become so hot because, despite requiring more arithmetic operations than a DGEMM implemented using WMMA/MFMA instructions, it can crank out a ton of FP64-equivalent operations per unit time. This is a result of the ridiculously nonlinear increases in throughput of low-precision tensor/matrix cores on modern GPUs; for example, Blackwell GPUs can perform over 100x more 8-bit ops than 64-bit ops despite being being only 8x smaller. As a result, you can burn a ton of 8-bit ops to emulate a single 64-bit matrix operation and still realize a significant net speedup over hardware-native FP64. Matsuoka presented the following slide to illustrate that:Dr. Uchino's estimates of how many FP64 FLOPS one can emulate using INT8 as presented by Satoshi Matsuoka.Emulation offers a way for scientific apps that need high-precision arithmetic to directly use AI-optimized accelerators that lack FP64 in hardware, so it's worth talking about at conferences like ISC. But it seems like everyone wanted to name-drop Ozaki, and the actual discussion around emulation was generally a rehash of content presented earlier in the year at conferences like GTC25.While hearing about FP64 emulation and Ozaki schemes got tiring throughout the week, I had to remind myself that I hadn't even heard about Ozaki before September 2024 at the Smoky Mountains Conference. The fact that the Ozaki scheme went from relative algorithmic obscurity to being the star of the show in nine months is either a reflection of its incredible importance in scientific computing or a testament to the reach of NVIDIA's marketing.Cynically, I'll bet that NVIDIA is probably doing everything it can to make sure the world knows about the Ozaki scheme, and ISC was a part of that. When the datasheets for Rubin GPUs are released, I'll bet the performance table has a row claiming a bazillion FP64 FLOPS, and there will be a tiny footnote that clarifies they're citing emulated FP64 precision. They did it with structured sparsity, and I'm sure they'll do it for emulated DGEMM.Although the Ozaki scheme is perhaps over-hyped considering how narrow its applicability is to the broad range of compute primitives used in scientific computing, I do anticipate that it is the tip of the iceberg. If 2025 was the year of the Ozaki scheme, 2026 may be the year of the emulated FP64 version of FFTs, sparse solvers, stencils, or other key algorithms. We're seeing signs of that already; David Keyes and Hatem Ltaief both presented material at ISC on using mixed-precision matrix operations for other scientific problems, and I mentioned their work in my earlier GTC25 blog. I'm not sure \"the Keyes scheme\" or \"the Ltaief scheme\" is as catchy as \"the Ozaki scheme,\" but I expect to hear more about these other emulation techniques before ISC26.Top500On the topic of matrix-matrix multiplication, I can't get too much farther without talking about the Top500 list released at ISC. Although there was no new #1 system, Europe's first exascale system, JUPITER, made its sub-exascale debut. There were also a number of new entries in Top50, and surprisingly, many of them came from companies who offer GPUs-as-a-Service for AI training rather than the usual public-sector sites delivering cycles for scientific research. However, all the new entries were still using previous-generation Hopper GPUs despite huge Blackwell coming online, exposing a perceptible lag between the state of the art in supercomputers for AI and traditional HPC.As with last year, I felt a growing tension between what the Top500 list brings to the discussion and where the large-scale supercomputing industry is headed. As I wrote earlier, mixed-precision and emulated FP64 was a hot topic in the technical program, but the emphasis of the Top500 session was still squarely on bulk-synchronous FP64 performance. HPL-MxP awards were handed out, but they all wound up in the hands of systems who were also at the top of the regular HPL list. Nobody is submitting HPL-MxP-only scores, and there was no meaningful discussion about the role that the Ozaki scheme will play going forward in Top500's future.Opining about the long-term future of the Top500 list is a whole separate blog post though, so I'll focus more on what was covered at this year's session.JUPITERJUPITER was the only new entrant into the Top 10, and it posted at #4 with an average 793 PF over a hundred-minute run. Though it hasn't broken the 1 EF barrier yet, JUPITER is noteworthy for a few reasons:It is expected to be Europe's first exascale system. Given this HPL run used only 79% of the Booster Module's 5,884 GH200 nodes, some basic extrapolation puts the full-system run just a hair above 1 EF. Jülich will either have to run with 100% node availability or get a few extra nodes to exceed 1 EF though.JUPITER is also now the biggest NVIDIA-based supercomputer on Top500, pushing Microsoft's H100 SXM5 system (Eagle) down to #5. JUPITER is also Eviden's biggest system and a strong affirmation that Europe isn't dependent on HPE/Cray to deliver on-prem systems of this scale.JUPITER was also installed into a modular datacenter, an approach that is emerging as a preferred method for rapidly deploying large GPU systems in Europe. This setup allowed Jülich to place shipping container-like modules on a concrete foundation in just a few months. However, because the datacenter is form-fit to the JUPITER system without much extra space, it's impossible to take a glamor shot of the entire machine from far away. As a result, most photos of JUPITER show only the datacenter modules that wrap the supercomputer racks. For example, Prof. Thomas Lippert shared this photo of JUPITER during his presentation:JUPITER's modular datacenter as seen from a drone flying overhead.As Lippert was describing JUPITER, I couldn't help but compare it to the AI supercomputers I support at my day job. Like JUPITER, our supercomputers (like Eagle) aren't very photogenic because they're crammed into form-fitted buildings, and they are best photographed from the sky rather than the ground. For example, here's a photo of one of Microsoft's big GB200 supercomputers that I presented later in the week:A slide showing one of Microsoft's big GB200 supercomputers that I presented at the SuperCompCloud workshop later in the week. The big two-story building in the center houses GPUs, and the long white building on the right houses storage and CPU-only nodes.JUPITER may be the first exascale system listed on Top500 that doesn't have fancy rack graphics, but I don't think it will be the last.I also found myself wondering if these modular datacenters are trading short-term upsides with long-term downsides. While they accelerate deployment time for one-off supercomputers, it wasn't clear to me if these modular structures is reusable. Does the entire datacenter retire along with JUPITER after 5-7 years?Hyperscalers use modular datacenters too, but the modularity is more coarse-grained to support a wider variety of systems over multiple decades. They're also physically more capacious, allowing them to deploy more CDUs and transformers per rack or row to retrofit them for whatever power and cooling demands evolve into over the full depreciation life of the datacenter building.HPC-AI system intersectionAs with last year, Erich Strohmeier did a walkthrough of Top500 highlights, and he argued that \"hyperscale\" is defined as anything bigger than 50 MW, and therefore the Top500 list is hyperscale. It wasn't clear what value there was in trying to tie the Top500 list to hyperscale in this way, but there were a few ways in which Top500 is beginning to intersect with hyperscale AI.Foremost is the way in which some exascale systems have been appearing on the list: they first appear after HPL is run on a big but partially deployed machine, then six months later, the full-system run is listed. Aurora and JUPITER both follow this pattern. What's not obvious is that many massive AI supercomputers also do something like this; for example, the Eagle system's 561 PF run was analogous to Aurora's initial 585 PF run or JUPITER's 793 PF run. The difference is that systems like Eagle typically enter production training after that first big tranche of GPUs is online, so there is never an opportunity to run HPL as more of the system powers up. Instead, the production training job simply expands to consume all the new GPUs as new tranches come online.This iteration of the Top500 list also saw a number of bona fide commercial AI training clusters from smaller GPU-as-a-Service and \"AI factory\" providers post results, giving the public a view of what these systems actually look like:Nebius listed ISEG2 at #13 with a 624-node, 202 PF H200 SXM cluster, following their 2023 Top500 debut with a 190-node, 46 PF H100 SXM cluster. Nebius was spun out of Yandex, the Russian tech conglomerate.Northern Data Group debuted Njoerd at #26 with a 244-node H100 SXM cluster. Northern Data Group started out as a German bitcoin mining company.FPT debuted at #36 with a 127-node H200 SXM cluster and #38 with a 127-node H100 SXM cluster. FPT is a Vietnamese technology conglomerate.It's notable that none of these systems resemble the sovereign AI systems or EuroHPC AI Factories cropping up in Europe, which are attached to traditional HPC centers and built on familiar HPC platforms like Cray EX or BullSequana. Rather, they're essentially NVIDIA reference architectures that resemble DGX SuperPods but are stamped out by companies like Supermicro, Gigabyte, and ASUS.While it's nice of these GPU-as-a-Service companies to participate in the Top500 list, I did not see anyone from these companies in the technical program in any other way. And I did not see anyone from the bigger GPU-as-a-Service providers (CoreWeave, Crusoe, Lambda, etc) contributing either. Thus, while these companies are participating in Top500, it doesn't seem like they're genuinely interested in being a part of the HPC community.Other new entrantsIf you take a step back and look at the ten largest systems that made their debut at ISC'25, they broadly divide into two categories. Here's the list:RankSystemPlatformSite4JUPITER BoosterGH200Jülich11Isambard-AI phase 2GH200Bristol13ISEG2H200 SXM5Nebius15ABCI 3.0H200 SXM5AIST17Discovery 6GH200ExxonMobil18SSC-24H100 SXM5Samsung26NjoerdH100 SXM5Northern Data Group27ABCI-QH100 SXM5AIST33AI-03MI210Core4236FPT AI Factory JapanH200 SXM5FPTAside from Core42's weird MI210 cluster, every new big system was either GH200 (for traditional HPC) or H100/H200 SXM5 (for AI). This suggests a few interesting things:None of the AI cloud/GPUaaS providers are talking about GH200. It seems that GH200 is squarely for scientific computing, and Hopper HGX systems is preferred for AI at scale.Despite debuting on Top500 two years ago, H100 is still making its way into the hands of HPC and AI sites. This could mean one of several things:H100 is more affordable now (Jensen says he can't give them away),there was a huge backlog of H100 orders, orit's just taking some places a really long time to get H100 up and runningBlackwell is not relevant to HPC right now. There are no big Blackwell systems on this list, nor was Blackwell discussed in any sessions I attended during the week. This is despite large GB200 systems being public, up, and benchmarked. For example, CoreWeave, IBM, and NVIDIA ran MLPerf Training across 39 racks (624 nodes) of a GB200 NVL72 system named Carina just last month. They did not appear to bother with HPL, though.From all this, it seems like there is a definite lag forming between what qualifies as \"leadership computing\" to HPC people and AI people. Today's leadership HPC (Hopper GPUs) is yesterday's leadership AI, and today's leadership AI (Blackwell GPUs) isn't on the radar of leadership HPC yet. Maybe GB200 will begin appearing one or two years later as the AI people move on to Vera-Rubin.So, if I had to guess, I think the top-end of Top500 in 2027 could look like one of three things:It will contain HPC systems with state-of-the-art, HPC-specific variants of accelerators that are completely irrelevant to AI. Large AI training systems will simply disappear from the list, because HPL has ceased to be a meaningful measure of their capability. GB200/GB300 simply never appear on Top500.It will contain HPC systems with previous-generation Blackwell accelerators after Jensen (the chief revenue destroyer) gets on stage and tells the world that Blackwell is junk because Rubin is awesome. The AI industry gobbles up all the Rubin GPUs, and HPC picks up the scraps they leave behind.Top500 starts allowing FP64 emulation, and all bets are off on how ridiculous the top systems' numbers look. In this case, top systems just skip the 1-10 exaflops range and start debuting at tens of exaflops.I have no idea where things will go, but we're starting to see big HPC deals targeting Vera Rubin that line up with the same time Rubin will land for the AI industry in 2H2026. So maybe Blackwell is just a hiccup, and option #1 is the most likely outcome.HPC around the worldThough Blackwell's absence from Top500 was easy to overlook, China's continued absence was much more obvious. Even though no new Chinese systems have been listed in a few years now though, representatives from several Chinese supercomputing centers still contributed invited talks throughout the week.In that context, I appreciated how fully ISC embraces its international scope. I found myself attending a lot of \"HPC Around the World\" track sessions this year, partly because I work for a multinational corporation and have to stay aware of potential needs outside of the usual US landscape. But there's also been a sharp rise in the amount of serious HPC that is now occurring outside of the USA under the banner of \"sovereign AI,\" and I've been keen to understand how \"sovereign AI\" compares to the US-based AI infrastructure in which I work.Before getting too deep into that though, China is worth discussing on its own since they had a such prominent presence in the ISC program this year.HPC in ChinaFollowing the single-track opening keynote on the first day of ISC is the single-track Jack Dongarra Early Career Award Lecture, and this year's talk was given by awardee Prof. Lin Gan from Tsinghua University. In addition, Dr. Yutong Lu gave two separate talks--including the closing keynote--which shed light on the similarities and differences between how China and the US/Europe are tackling the challenges of exascale and beyond.China is in a position where it does not have access to US-made GPUs, forcing them to develop their own home-grown processors and accelerators to meet their needs for leadership computing. As a result, both speakers gave talks that (refreshingly) revolved around non-GPU technologies as the basis for exascale supercomputers. Although neither Gan nor Lu revealed anything that wasn't already written about in the Gordon Bell prize papers, I took away a few noteworthy observations:The most public Chinese exascale system is always called the \"New Sunway\" or \"Next Generation Sunway,\" never \"OceanLight\" as has been reported in western media. There still aren't any photos of the machine either, and Dr. Gan used stock diagrams of the predecessor Sunway TaihuLight to represent New Sunway. There was no mention of the Tianhe Xingyi/TH-3 supercomputer at all.Chinese leadership computing details remain deliberately obfuscated despite the openness to present at ISC. For example, Lu presented the following English-language table from the 2024 China Top100 HPC list:No.VendorSystemSiteYearApplicationCPU CoresLinpack (Tflops)Peak (Tflops)Efficiency (%)1Server ProviderSupercomputing system mainframe system, heterogeneous many-core processorSupercomputing Center2023computing service15,974,400487,540620,00078.72Server ProviderInternet Company Mainframe System, CPU+GPU heterogeneous many-core processorInternet company2022computing service460,000208,260390,00053.43Server ProviderInternet Company Mainframe System, CPU+GPU heterogeneous many-core processorInternet company2021computing service285,000125,040240,00052.14NRCPCSunway TaihuLight, 40960*Sunway SW26010 260C 1.45GHz, customized interconnectionNSCC-WX2016supercomputing center10,649,60093,015125,43674.25Server ProviderInternet Company Mainframe System, CPU+GPU heterogeneous many-core processorInternet company2021computing service190,00087,040160,00051.26NUDTTianhe-2A, TH-IVB-MTX Cluster + 35584*Intel Xeon E5-2692v2 12C 2.2GHz + 35584 Matrix-2000, TH Express-2NSCC-GZ2017supercomputing center427,00861,445100,67961.07Server ProviderInternet Company Mainframe System, CPU+GPU heterogeneous many-core processorInternet company2021computing service120,00055,880110,00050.88Server ProviderShenweiJing Supercomputer System, 1024*SW26010Pro heterogeneous many-core processor 390C MPE 2.1 GHzComputing Company2022scientific computing399,36012,91214,36289.99Server ProviderSupercomputing Center System, 992*SW26010Pro heterogeneous many-core processor 390C MPE 2.1 GHzSupercomputing Center2021scientific computing386,88012,56913,913.090.310BSCCC/IntelBSCCC T6 Section 5360*Intel Xeon Platinum 9242 homogeneous many-core processor 48C 2.3 GHz, EDRBSCCC2021computing service257,28010,83718,935.057.2The #1 system is almost definitely built on SW26010P processors just like the big New Sunway system that Gan discussed (15,974,400 cores / 390 cores per SW26010P = 40,960 nodes), but it's significantly smaller than the 39M cores on which the work Gan highlighted was run. Clearly, China's biggest systems aren't on their own Top100 list, and their #1 listed system only says its processors are \"heterogeneous many-core\" despite smaller entries explicitly listing SW26010P (Pro) processors.Chinese leadership computing struggles aren't being hidden. Lu specifically called out a \"lack of a new system\" in 2024, echoing earlier sentiments from other leaders in Chinese HPC who have referred to \"some difficulties in recent years\" and a \"cold winter\" of HPC. She also said that their leadership systems are \"relatively\" stable rather than trying to overstate the greatness of Chinese HPC technology. But as with above, she didn't get into specifics; by comparison, Scott Atchley (of Oak Ridge Leadership Computing Facility) specifically quoted a 10-12 hour mean time between job interrupt on Frontier after his keynote. Whether 10-12 hours is \"relatively stable\" remained unspoken.Performance portability wasn't a top-line concern despite how hard it seems to port applications to Chinese accelerators. SW26010P is weird in that it has a host core and offload cores with scratchpads, and its native programming model (Athread) is very CUDA-like as a result. Gan made it seem that China is investing a lot of effort into \"fine-grained optimizations\" using OpenACC and Athread, and he showed all the ways in which they're rewriting a lot of the kernels and decompositions in complex applications (like CAM) to make this work. This sounds like an performance portability nightmare, yet there wasn't much talk about Chinese equivalents to performance portability frameworks like Kokkos, RAJA, or alpaka.Lu did name-drop a few frameworks that unify HPC and AI performance portability from around the world:Yutong Lu's only reference to software that enhances portability and productivity. Not quite the same as what Kokkos, Raja, and alpaka aim to solve, though.However, these were more about aligning efforts across scientific computing and AI rather than enabling scientific apps to run seamlessly across China's different exascale accelerators.Application focus areas in China seem similar to everywhere else. Classical and quantum materials modeling, climate and ocean modeling, electronic structure calculations, and genomics were all mentioned by Gan and Lu in their talks. There was no mention of stockpile stewardship or any defense-related applications of HPC, though I'm sure China is using big supercomputers in these efforts just as US and European nations do. The only unusual application that I noticed was Gan's mention of implementing reverse time migration (RTM) on FPGAs; I've only ever heard of RTM in the context of oil exploration. Though I'm no expert, I didn't think many HPC centers spent a lot of time focusing on that technique. I do know KAUST has done some work optimizing RTM applications with Aramco in the space, but most other national supercomputing centers keep oil and gas at arm's length. Gan's RTM work may be related to earthquake modeling rather than petroleum, but it stood out nonetheless.Nobody talked about GPUs. Gan spent a healthy amount of time talking about applying FPGAs and NPUs to scientific problems, but these are areas of research that are on the fringes of mainstream HPC. I'm not sure if this reflected his own interests or priority research directions in China, but given that Chinese researchers cannot procure NVIDIA or AMD GPUs, perhaps FPGAs and NPUs are being pursued as a potential next-best-thing. Necessity truly is the mother of invention, and China might be the driver of a disproportionate amount of innovation around dataflow processing and reduced precision for modeling and simulation workloads.Nobody talked about storage either. I'm not sure if this suggests China has a lopsided interest in compute over holistic system design, or if they just talked about their biggest challenges (which are using home-grown accelerators productively). Granted, keynote speakers rarely talk about storage, but I didn't see much participation from China in any of the subsystem-specific sessions I attended either. This is particularly notable since, for a time, Chinese research labs were dominating the IO500 list with their home-made file systems. Networking was mentioned in passing in Lu's closing keynote, but not much beyond another example of technology fragmentation, and there were no specific Chinese interconnects being discussed during the week.China is in the thick of AI just like the rest of the world. Lu said that 30% of the cycles on their big HPC systems go to AI, which is right in line with anecdotes from other HPC sites that put their figures at somewhere up to 50%. She also presented the Chinese taxonomy of the three ways in which AI and scientific computing can mesh together: HPC for AI (training LLMs on supercomputers), HPC by AI (AI for system design and operations), and HPC and AI (AI in the loop with simulation). China is also neck-deep in figuring out how to exploit reduced precision (or \"intelligent computing,\" as Lu branded it) and has pivoted from being \"performance driven\" (which I took to mean HPL-driven) to \"target driven\" (which I took to mean scientific outcome-driven). This is consistent with their recent Gordon Bell prize win and non-participation in either Top500 or China Top100.China is embracing geo-distributed supercomputing and complex workflows, much like the US. Lu specifically called out \"Computility Net,\" a catchy name that sounded a lot like the US DOE's Integrated Research Infrastructure (IRI). She described it as a national effort to combine supercomputing with \"commodity IT\" resources (perhaps Chinese cloud?) to enable \"resource sharing\" through a \"service grid.\" In her closing keynote, she even name-dropped IRI:The Chinese vision for Computility Net, which seems analogous to the US Integrated Research Infrastructure, as presented by Yutong Lu.She did liken Computility to both IRI in the US and PRACE in the EU though, and in my mind, PRACE is nothing like IRI. Rather, PRACE is more like TeraGrid/XSEDE/ACCESS in that it federates access to HPC systems across different institutions, whereas IRI's ambition is to tightly integrate computational and experimental facilities around the country. But from the above slide, it sounds like Computility Net is closer to IRI since it is coupled to \"Supercomputing internet\" (akin to ESnet?) and bridging compute and data across eastern and western China.Elsewhere in AsiaAlthough Chinese researchers headlined a few sessions at ISC, a number of other Asian nations presented their national supercomputing strategies as well. Japan and Korea have mature, world-class HPC programs, but I was surprised to see how ambitious India has become to catch up. Smaller nations were also represented, but it was clear to me that their focus is spread across midrange HPC, partnering with large centers in Korea/Japan, and innovating around the edges of supercomputing. And perhaps unsurprisingly, every nation represented had a story around both quantum computing and artificial intelligence regardless of how modest their production modsim infrastructure was.India appears to rapidly catching up to the US, Europe, and Japan much in the same way China was fifteen years ago. Representatives from C-DAC, the R&amp;D organization that owns the national supercomputing mission in India, gave a far-reaching presentation about India's ambition to achieve exascale by 2030. Their current strategy appears to be broad and capacity-oriented, with forty petascale clusters spread across India for academic, industrial, and domain-specific research. They have a comprehensive, if generic, strategy that involves international collaboration in some regards, reliance on open-source software to fill out their HPC environment story, and home-grown hardware and infrastructure:India's ambitious strategy towards exascale in 2030. This slide has it all, from home-grown CPUs and networks to five systems deployed in six years.I was surprised to hear about their ambitions to deploy their own CPUs and interconnect though. India is pursuing both ARM and RISC-V for their own CPUs for a future 200 PF system, and they're already deploying their \"InfiniBand-like\" interconnect, TRINETRA, which uses funny NICs with 6x100G ports or 10x200G ports rather than fewer, faster serdes. I didn't hear mention of their AI acceleration plans, but rolling their own commercialized CPU and interconnect in itself is a lot to bite off. Given that India is the world's fastest growing economy though, these plans to go from 20 PF in 2025 to 1 EF in 2030 may not be that far-fetched. Perhaps the Indian national strategy will become clearer during the inaugural Supercomputing India 2025 conferece this December.The Korea Institute of Science and Technology Information also took the stage to describe their next national supercomputer, KISTI-6, which was first announced in May 2025. It will be a 588 PF Cray EX254n system with 2,084 nodes of GH200, similar to Alps and Isambard-AI. This is quite a step up from its predecessor, which was an air-cooled KNL system, but it's unlikely it will unseat Fugaku; the 588 PF number cited appears to be the sum of 2,084 GH200 nodes, 800 Turin CPU nodes, and 20 H200 SXM5 nodes. The HPL score of its GH200 nodes will place it below Alps and somewhere around 350 PF, likely joining a flood of multi-hundred-petaflops GH200 systems that will appear between now and ISC26.Singapore (NSCC) and Taiwan (NCHC) both presented their national programs as well, but they appear to be much more nascent, and the size of their HPC infrastructure was presented as aggregate capacity, not capability. Their strategies involve partnership with Japan or Korea, but both had specific carveouts for both sovereign AI and quantum computing. Interestingly, their use cases for AI both had a strong story about training models that understood the diversity of languages and dialects represented in their nations. For example, it is not unusual for people to switch languages or dialects mid-sentence in Singapore, and the big Western models aren't designed for that reality. Similarly, Taiwan has 16 indigenous tribes with 42 dialects. It seemed like enabling LLMs that reflect the breadth languages used in Singapore and Taiwan have become the responsibility of these nations' respective national supercomputing efforts.That said, that noble mission didn't seem to be matched with substantial training infrastructure; these localized models will be relying on a couple hundred GPUs here and there, wedged into existing HPC centers. Thus, these sovereign models are probably going to be fine-tuned variants of open models, aligning with my earlier observation that these smaller nations will be innovating around the edges of HPC and AI.What was missing? Although Vietnam, Thailand, Malaysia, and other Asian nations have strong HPC programs centered around industrial uses, they were not represented in ISC's HPC Around the World track. Also absent was any meaningful discussion around cloud; while everyone had a throwaway line about cloud in their presentations, the fact that the only big clouds in Asia are Chinese and American probably makes it unappealing to integrate them into the core of these nations' national HPC strategies. Speaking from experience, this is quite different from the attitudes of commercial HPC users across Asia who are all too happy to let someone else run HPC datacenters for them.The Middle EastAlthough KAUST has been a world-class HPC center in the Middle East for the past fifteen years, AI seems to be where the majority of new investment into HPC is going.In describing new efforts in Saudi Arabia, Prof. David Keyes casually mentioned the Saudi HUMAIN effort, which will build 500 MW of datacenter capacity and 18,000 GB300 GPUs, after describing the Shaheen-3 GH200 upgrade that \"might (barely)\" put it back in the Top20 by SC'25. Similarly, Dr. Horst Simon walked through a few of Abu Dhabi's university clusters (each having dozens of GPU nodes) after skating through an announcement that a 5 GW AI campus was also being built in Abu Dhabi. The gap between investment in AI and investment in HPC was striking.I also had a brief conversation with someone from one of the major Abu Dhabi universities, and I was very surprised to find that I was talking to a real AI practitioner--not an HPC person moonlighting in AI--who spoke at the same depth as the customers with whom I work in my day job. The nature of his work made it clear to me that, despite his university not having a Top500 system, he was familiar with running training and inference at scales and with sophistication that is far beyond the experience of most ISC attendees.These interactions led me to the conclusion that the Middle East's approach to \"sovereign AI\" is quite different from Europe's. Rather than building HPC systems with GPUs, letting HPC centers operate them, and calling them sovereign AI platforms, nations like Saudi Arabia and UAE are keeping HPC and AI separate. Like in the US, they are going straight to hyperscale with AI, and they have no preconceived notion that anything resembling a supercomputer must be hosted at a supercomputer center.Of course, only nations like Saudi Arabia and UAE can afford to do this, because they have trillion-dollar sovereign wealth funds to invest in massive infrastructure buildout that doesn't isn't contingent on public consensus or the latest election cycle. Just as UAE's Core42 can build a 5 GW datacenter campus with little oversight, these nations can easily mis-step and invest a ton of money in an AI technology that turns out to be a flop. In the end, it seems like these Middle Eastern nations are willing to take bigger risks in how they build out their sovereign AI infrastructure, because they are largely starting from a blank sheet of paper. They aren't limiting themselves to 20 MW supercomputers like the HPC world had.All things being equal, this might turn out to be an advantage over other nations who are more hesitant to deviate from the tried-and-true course of buying a Cray or a Bull, sticking some GPUs in it, and calling it AI. If these Middle Eastern nations do everything right, they stand to get a lot further and move a lot faster in sovereign AI than Europe, and it'll be fascinating to see how quickly they catch up with the sort of frontier AI research being done private industry. But, as with the US AI industry, it doesn't seem like these AI practitioners are going to be attending ISC in the same way European sovereign AI folks do; the roads of HPC and AI seem to run parallel without intersecting in the Middle East.ExhibitorsISC had a record number of exhibitors this year, and as usual, I tried to set aside at least an hour or two to walk the floor and see what technologies are on the horizon. This year, though, the exhibit hall was not a great representation of the rest of the conference. Everyone I talked to about the exhibit said one of two things:There are a LOT of quantum companies.A lot of big companies were noticeably absent.It also didn't feel like the biggest exhibit ever, partially because of #2, and partially because many of the exhibitors--one in five--was exhibiting for the first time this year. This meant a lot of the booths were small and barebones, and many of them belonged to either companies at the periphery of HPC (such as companies that make dripless couplers for liquid cooling) or small startups who just had a desk, a few pens, and some brochures.On the first point, it was true--quantum computing was well represented, with 22% of exhibitors identifying as being involved in the field in some form. In fact, quantum felt over-represented, since the ISC technical program certainly didn't have such a large fraction of talks on quantum computing topics. I didn't have time to actually talk with any of these quantum companies though, so wasn't able to get a sense of why the startup ecosystem around quantum computing was so rich in Europe as compared to the US.While there was an abundance of quantum this year, a number of the big HPC and HPC-adjacent companies were noticeably absent:Amazon, Azure, and Google did not have booths despite having booths last year. Amazon and Google still sponsored the conference at the lowest tier (bronze) though, while Microsoft did not sponsor at all.Intel had neither booth nor sponsorship despite having the #3 system on Top500. I don't think they held a party this year, either. AMD didn't have a booth, but they sponsored (and gave the opening keynote!)WEKA neither had a booth nor sponsored the conference this year, although they were the leading sponsor of the Student Cluster Competition. Competitors DDN, VAST, Quobyte, and BeeGFS all had booths, but only VAST sponsored. Curiously, Pure and Scality, which do not big footholds in leadership HPC, did both booths and sponsorship.These companies who chose not to have a booth still sent people to the conference and were conducting meetings as usual, though. This suggests that there's something amiss with how large companies perceive the return on investment of having a booth at ISC. I don't have any insider knowledge here, but I was surprised by the pullback since ISC has historically been very good at incentivizing attendees to walk through the expo hall by putting it between the technical sessions and the food breaks.As I walked the exhibit floor, I found that prominent booths spanned the whole HPC stack: software, system integrators, component makers (CPUs, GPUs, HBM and DDR, and SSD and HDD), and datacenter infrastructure were all exhibiting. The most eye-catching booths were those with big iron on display: HPE/Cray had a full EX4000 cabinet and CDU on display, and there were a few Eviden BullSequana nodes floating around.The Cray EX4000 cabinet (right) and its CDU (left) on display at the ISC'25 exhibition hall. One of the most eye-catching displays, even they've been on display at ISC and SC for a few years now.Sadly, though, there were no full BullSequana X3000 racks on display. I've still never seen one in real life.Infrastructure companies like Motivair (who manufactures the CDUs for Cray EX) and Rittal (which I know as a company that manufactures racks) also had big liquid-liquid head exchangers on display with shiny steel piping. Here's a smaller version of the Cray EX CDU that Motivair was displaying:A close-up view of a smaller liquid-liquid heat exchanger CDU on display at the Motivair booth right next to HPE's. Strangely, the mechanics of these systems dovetails with what I've learned as a part of my other hobby outside of HPC, which is operating a multi-family residential high-rise.I got to chatting with some good folks at Motivair, and I learned that the 1.2 MW variant that is used with Cray EX has a 4\" connection--the same size as the water main in my coop. Since I recently helped with the replacement of my building's water main, this led me down a rabbithole where I realized that the flow rates for this CDU is roughly the same as my apartment building too, which is to say, a single Cray CDU moves as much fluid as a 55-unit apartment building. Incidentally, a single Cray EX cabinet supports roughly the same electrical capacity as my 55-unit building too--I am in the process of replacing our 1,200 A service panel, which comes out to about the same 400 kVA as fully loaded EX.Aside from the Cray cabinets and CDUs, which are no longer new to ISC, I couldn't put my finger on any particularly outstanding booths this year though. The exhibit felt like a sea of smaller companies, none of which really grabbed me. This isn't to say that big vendors were wholly absent though. Despite not having booths, all three big cloud providers threw parties during the week: AWS and NVIDIA teamed up on a big party with over a thousand registrants, while Google and Microsoft held smaller parties towards the end of the week. HPE also threw a lovely event that was off the beaten path along the Elbe, resulting in a less-crowded affair that made it easy to catch up with old friends.I may be reading too much into this year's exhibit, but it felt like ISC might be transforming into an event for smaller companies to gain visibility in the HPC market, while larger companies apply their pennies only in the parts of the conference with the highest return. Whether a company chose to have a booth, sponsor the conference, and/or throw a party seemed to defy a consistent pattern though, so perhaps other factors were at play this year.Cloud, or lack thereofBecause I work for a large cloud service provider, I attended as many cloud HPC sessions as I could, and frankly, I was disappointed. The clear message I got by the end of the week was that Europe--or perhaps just ISC--doesn't really care about the cloud. This is quite different from the view in the US, where the emergence of massive AI supercomputers has begun to shift opinions to the point where the successor to the Frontier supercomputer at OLCF might wind up in the cloud. I suppose cloud is a lot less attractive outside of the US, since all the major cloud providers are US corporations, but the way in which cloud topics were incorporated into the ISC program this year sometimes felt like a box-checking exercise.For example, I attended the BOF on \"Towards a Strategy for Future Research Infrastructures\" which I expected to be a place where we discussed the best ways to integrate traditional HPC with stateful services and other workflow components. While cloud was mentioned by just about every panelist, it was almost always in a throwaway statement, lumped in with \"the edge\" or cited as a vague benefit to \"new workflows and interactive analysis\" with no further detail. One speaker even cited egress fees as a big challenge which, to me, means they haven't actually talked to a cloud provider in the last five to ten years. If egress fees are what stop you from using the cloud, you're talking to the wrong account team.I get it though; there are times where cloud often doesn't offer enough obvious benefit for HPC to justify the effort required to figure it out. In those cases, it's incumbent on cloud providers to provide a better story. But I was also disappointed by the invited session called \"Bridging the Gap: HPC in the Cloud and Cloud Technologies in HPC,\" which I hoped would be the place where cloud providers could make this case. Instead, only two of the three CSPs were even invited to speak, and it was clear that the speakers did not all get the same assignment with their invitations. Granted, the CSP for whom I work was the one not invited (so I came in a little biased), but I was surprised by how differently each speaker used their time.Dr. Maxime Martinasso from CSCS gave a talk from the perspective of trying to add cloud-like capabilities to a supercomputer, which is a recurring pattern across a number of sites (including many in the US DOE) and projects. He explained the way they're creating an infrastructure-as-code domain-specific language that sits on top of Alps, their Cray EX system, to give users the ability to bring their own software stacks (all the way down through Slurm) to the supercomputer. It was clearly a ton of work on CSCS's part to develop this capability, and yet the talk's \"future work\" slide contained a bunch of features which those of us in the cloud would consider \"P0\"--priority zero, or essential for a minimum viable product.By the end of Martinasso's talk, I realized that CSCS's perspective is that, unlike commercial cloud, these cloudy features aren't P0; having a supercomputer on the floor is. He made the case that CSCS has a need to explore diverse computing architectures and accelerators (as evidenced by the five different node types in Alps!), and putting them all on a single RDMA fabric isn't something any cloud provider will do. As a result, adding any new cloud-like capability to the heterogeneous supercomputer is just gravy, and the fact that true cloud is more \"cloudy\" than Alps is irrelevant since the cloud will never support the intra-fabric heterogeneity that Alps does.The other two speakers represented big cloud providers, and their talks had a bit more product pitch in them. One speaker talked through the challenges the cloud is facing in trying to fold supercomputing principles into existing cloud infrastructure (a theme I repeated in my talk later in the week) before talking about specific products that have arisen from that. It touched on some interesting technologies that the HPC world hasn't yet adopted (like optical circuit switching--super cool stuff for programmable fabrics), and I learned a few things about how that provider might bring new HPC capabilities to the table for specific workloads.The other speaker, though, presented a textbook pitch deck. I've give almost the same exact presentation, down to showing the same sort of customer stories and product comparison tables, during customer briefings. Execs in the audience would eat it up while engineers' eyes would glaze over, and having to do that song and dance is partly why I didn't make it as a product manager. I was incredulous that such a presentation was an invited talk at one of the most prestigious HPC conferences in the world.This is not to say I was mad at the speaker. He did exactly what one would expect from a leader in the sales side of an organization, hitting all the notes you'd want in a textbook pitch aimed at the C-suite. Rather, I was disappointed by the choice by the session organizers; when you invite someone whose job is driving business at one of the largest cloud providers to speak, you should fully expect a broad and salesy presentation. I don't think it's a stretch to say that most ISC attendees aren't looking for these sorts of high-level talks designed for enterprise decision-makers; they want insight and technical depth.Was I miffed that a competitor got to give a twenty-minute sales pitch during a session at which I wasn't invited to speak? Absolutely. And do I think I could've given a talk that even the most ardent cloud-hater would find something interesting in it? Probably. But since that didn't happen, the best I can do is complain about it on the Internet and hope that next year's program committee puts more care into organizing an invited speaker session on cloud and HPC.Thankfully, I was given the opportunity to talk a little about my work at the SuperCompCloud workshop on Friday. That workshop felt like what the \"Bridging the Gap\" invited session should've been, and there were roughly equal parts of presentations on adding cloud-like features to their HPC infrastructure and adding HPC-like features to cloud infrastructure. From my perspective, the workshop was great; I got to see how traditional HPC centers are adopting cloud practices into their operations, and I could explain how we overcame some of the challenges they're facing in Azure. But to my point at the outset of this section--that Europe doesn't really care about the cloud--the majority of speakers at SuperCompCloud were American.Parting thoughtsAs I said at the outset, there were way more sessions that I missed than I attended. In addition, a lot of the big headlines of the week were coincident with, not made at, the conference. A few noteworthy announcements during the week that I won't go into detail about include:£750M was awarded to EPCC to deploy what sounds like the UK's first exascale system. This announcement's overlap with ISC was a total coincidence, so EPCC didn't have many details to share.The Ultra Ethernet Consortium announced the long-awaited version 1 of its spec. I'm not sure how relevant this is to HPC yet, but given how many networking talks compared themselves against InfiniBand, I think there's a lot of appetite for a high-performance, non-proprietary alternative.Sadly, HPC_Guru announced his retirement mid-week as well. It's not clear this was deliberately timed with ISC, but it was acknowledged on the big stage during the ISC closing statements and resulted in a lot of recognition online. I credit HPC_Guru, whoever he is, with a lot of the success I've enjoyed in my career, as he amplified my voice as far back as 2009 when I first started on Twitter. Maybe with his retirement, I should try to do for others what he did for me.And along the lines of reflecting back over the years, this was ISC's 40th anniversary, and the organizers had a few wonderful features to commemorate the milestone. Addison Snell organized a panel where a variety of attendees got to discuss the impact that the conference has had on them over the past 40 years, and I was delighted to find that I was not the only person to reflect back on how ISC has shaped my career. As critical as I can be of specific speakers and sessions when I write up these notes, I do hope it goes without saying that I wouldn't bother doing all this for a conference that wasn't deeply engaging and rewarding to be a part of.Going back to this year's theme of connecting the dots, I think it's apt. Some ways in which HPC connected dots at ISC this year were obvious; the conference brought together people with a common interest in high-performance computing from across 54 countries and seven continents this year. But this year's conference also made it clear that the role of HPC going forward may be connecting the dots between different technologies being developed for AI, cloud, enterprise, and other markets and the problems in scientific computing that need to be solved.The latest and greatest Blackwell GPUs barely registered at ISC this year, and the HPC community seems OK with that now. Instead of the focus being on the absolute top-end in high-performance accelerators, HPC's focus was on connecting the dots between last generation's GPUs and today's grand challenges in science. Instead of showcasing the newest innovations in secure computing in the cloud, HPC's focus was in connecting the dots between a few relevant pieces of zero trust and big-iron on-prem supercomputers.HPC has always been about figuring out ways to use stuff invented for someone else to solve scientific challenges--connecting the dots. Beowulf clusters started that way, GPGPU computing started that way, and emulating DGEMMs (and other primitives) on AI accelerators will probably follow the same pattern. But different nations are drawing different lines between the dots; while the US might draw a shorter line between commercial cloud and HPC at scale, Europe is drawing shorter lines between HPC for scientific computing and HPC for sovereign AI.If we accept that connecting the dots may be where the HPC community can make the most impact, then it's fitting that ISC chose to carry forward the theme of \"connecting the dots\" into ISC'26. This break from the tradition of introducing a new tagline each year suggests that, at times, optimizing what we already have can take us further than than pursuing something completely new. After 40 years, ISC remains not only a showcase of innovation, but a reflection of how the HPC community (and its role in the technology landscape) is evolving. If we continue to embrace this theme of stitching together breakthroughs instead of spotlighting them individually, the HPC community is likely to be more relevant than ever alongside--not in spite of--the overwhelming momentum of hyperscale and AI.",
            "content_html": "<p>I had the pleasure of attending the 40th annual ISC High Performance conference this month in Hamburg, Germany. It was a delightful way to take the pulse of the high-performance computing community and hear what the top minds in the field are thinking about.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">The main foyer of Congress Center Hamburg, and the view that greeted me on the first morning of ISC'25. </figcaption></figure></div><p>The conference felt a little quieter than usual this year, and there didn't seem to be as many big ideas and bold claims as in years past. There was <a href=\"https://www.theregister.com/2025/06/10/jupiter_europes_top_super/\">a new Top 10 system announced</a>, but it was built using previous-generation Hopper GPUs. There were a <a href=\"https://isc-hpc.com/the-isc-2025-exhibition-sets-new-records/\">record number of exhibitors</a>, but many of the big ones (Intel, AMD; the big three cloud providers) were all absent. And while there were some exciting new technologies (like <a href=\"https://www.tomshardware.com/pc-components/gpus/amd-announces-mi350x-and-mi355x-ai-gpus-claims-up-to-4x-generational-gain-up-to-35x-faster-inference-performance\">AMD MI350-series GPUs</a> and <a href=\"https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/\">Ultra Ethernet v1.0</a>) debuting during the week, they actually debuted elsewhere and were simply referenced throughout the week's talks.</p><p>This year's ISC really felt like the place where the big news of the industry was being repeated in the context of scientific computing instead of being stated for the first time. And maybe this is the future of HPC conferences: rather than being where new technology is announced, perhaps ISC will become where the scientific community tries to figure out how they can use others' technology to solve problems. That idea--figuring out how to make use of whatever the AI industry is releasing--was certainly pervasive throughout the ISC program this year. The conference's theme of \"connecting the dots\" felt very appropriate as a result; rather than defining new dots, the conference was all about trying to make sense of the dots that have already been drawn.</p><p>I took plenty of notes to try to keep track of everything that was being discussed, and as has become tradition, I've tried to summarize some of the key themes in this post.</p><h2 style=\"text-align: left;\">Table of contents</h2><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#zettascale\">Zettascale</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#ozaki-ozaki-ozaki\">Ozaki, Ozaki, Ozaki</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#top500\">Top500</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#jupiter\">JUPITER</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#hpcai-system-intersection\">HPC-AI system intersection</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#other-new-entrants\">Other new entrants</a></li></ul></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#hpc-around-the-world\">HPC around the world</a><ul><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#hpc-in-china\">HPC in China</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#elsewhere-in-asia\">Elsewhere in Asia</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#the-middle-east\">The Middle East</a></li></ul></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#exhibitors\">Exhibitors</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#cloud-or-lack-thereof\">Cloud, or lack thereof</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#parting-thoughts\">Parting thoughts</a></li></ul><h2 id=\"zettascale\">Zettascale</h2><p>Now that exascale is squarely in the rear-view mirror of HPC, an increasing number of high-profile speakers began pushing on zettascale as the next major milestone. Like the early days of exascale, most of the discourse was less about what can be achieved with zettascale and more about the technology challenges that need to be surmounted for HPC to continue moving forward. And to that end, using zettascale to justify tackling big hardware and software challenges wasn't a bad thing, but it felt like every talk about zettascale this year was still more fanciful than anything else.</p><p>The opening keynote, \"HPC and Al - A Path Towards Sustainable Innovation\" was delivered by a duo of CTOs: Mark Papermaster (of AMD) and Scott Atchley (of Oak Ridge Leadership Computing Facility). It was a textbook keynote: it had inspiring plots going up and to the right that showed huge potential! It had scary linear extrapolations showing that staying the course won't do! It had amazing science results enabled by big iron! It even had a surprise product debut in MI355X! ChatGPT couldn't have come up with a better structure for a keynote presentation. But as is my wont, I listened to the talk with a little skepticism and found myself raising an eyebrow a few times.</p><p>A part of Papermaster's presentation involved an extrapolation to zettascale by 2035 and claimed that HPC is approaching an \"energy wall:\"</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Extrapolating ten years on a semilog plot is a great way to cause alarm in people who don't pay close attention to axes.</figcaption></figure></div><p>He specifically said that we'd need 1 GW per supercomputer to reach zettascale by 2035 on the current trajectory. He then used this to motivate \"holistic co-design\" as the only way to reach zettascale, and he went on to talk about all the same things we heard about leading up to exascale: increase locality and integration to reduce power and increase performance.</p><p>While I agree that we should aspire to do better than a gigawatt datacenter, this notion that there is an \"energy wall\" that stands between us and zettascale is a bit farcical; there's nothing special about a 1 GW zettascale supercomputer, just like there was nothing special about 20 MW for exascale. You might argue that building a supercomputer that consumes all the power of a nuclear reactor might be fundamentally more difficult than one that consumes only 20 MW, and you'd be right--which is why the first gigawatt supercomputers probably aren't going to look like the supercomputers of today.</p><p>Papermaster's \"energy wall\" slide reminded me of <a href=\"https://fee.org/articles/the-great-horse-manure-crisis-of-1894/\">the great horse manure crisis of 1984</a>, where people extrapolated from today using an evolutionary, not revolutionary, trajectory. If building a single gigawatt supercomputer is inconceivable, then build four 250 MW supercomputers and put a really fast network between them to support a single, synchronous job. The AI industry is already headed down this road; <a href=\"https://glennklockwood.com/garden/multicluster-training\">Google, Microsoft, and OpenAI have already talked about how they synchronously train across multiple supercomputers</a>, and Microsoft announced their <a href=\"https://view.officeapps.live.com/op/view.aspx?src=https%3A%2F%2Fmediusdownload.event.microsoft.com%2Ftranscripts%2FD6K5%2FKEY020%2FKEY020.docx%3Fsv%3D2018-03-28%26sr%3Db%26sig%3D0gs30Jf82r%252BresqqpGIGKRSOtKrvicgbpqh5Tdkigpg%253D%26se%3D2025-06-24T18%253A46%253A13Z%26sp%3Dr&amp;wdOrigin=BROWSELINK\">400 Tb/s \"AI WAN\" for this last month</a> as a means to enabling wide-area training.</p><p>Granted, it's unlikely that the HPC community will be building massive, distributed supercomputers the way hyperscale is. But I was disappointed that the keynote only went as far as saying \"a gigawatt supercomputer is crazy, so we need codesign at the node/rack scale.\" Codesign to reach zettascale will probably require a whole new approach that, for example, accounts for algorithms that <a href=\"https://github.com/NVIDIA/nccl/pull/1659\">synchronize communication across multiple datacenters</a> and power plants. The infrastructure for that is already forming, with the US developing its Integrated Research Infrastructure (IRI) and Europe shaping up to have over a dozen AI factories. Zettascale by 2035 may very well exist for the scientific computing community, but it'll probably look a lot more like hyperscale zettascale rather than a single massive building. A single machine plugged into a gigawatt nuclear reactor only happens if business-as-usual is extrapolated out another ten years as Papermaster did, and the codesign required to achieve that isn't very meaningful.</p><p>Prof. Satoshi Matsuoka also gave a talk on the big stage about <a href=\"https://glennklockwood.com/garden/systems/FugakuNEXT\">Fugaku-NEXT</a>, which Japan has branded as a zettascale system. His vision, which will be realized before 2030, aims to deploy a single, 40 MW supercomputer (much like <a href=\"https://www.glennklockwood.com/garden/systems/Fugaku\">Fugaku</a> was) where:</p><ul><li>10x-20x speedup comes from hardware improvements</li><li>2x-8x speedup comes from mixed precision or emulation (more on this below)</li><li>10x-25x speedup comes from surrogate models or physics-informed neural networks</li></ul><p>The net result is a 200x-4000x speedup over Fugaku. His rationale is that this will result in a system that is effectively equivalent to somewhere between 88 EF and 1.7 ZF FP64. It's not literally doing that many calculations per second, but the science outcomes are equivalent to a brute-force approach using a much larger system.</p><p>I thought this approach to reaching zettascale was much more realistic than the Papermaster approach, but it does require the scientific computing community to redefine its metrics of success. If HPL was a bad benchmark for exascale, it is irrelevant to zettascale since it's unlikely that anyone will ever run HPL on a zettascale system. At best, we'll probably see something like <a href=\"https://hpl-mxp.org\">HPL-MxP</a> that captures the 10x-20x hardware speedup and the 2x-8x mixed-precision or emulated FP64 reach hundreds of exaflops, but the 10x-25x from surrogate models will be domain-specific and defy simplistic ranking. If I had to guess, the first zettascale systems will be benchmarked through Gordon Bell prize papers that say things like \"simulating this result using conventional FP64 would have required over 1 ZF for 24 hours.\"</p><h2 id=\"ozaki-ozaki-ozaki\">Ozaki, Ozaki, Ozaki</h2><p>Although Prof. Matsuoka evoked the 2x-8x speedup from mixed precision or emulation when claiming <a href=\"https://www.glennklockwood.com/garden/systems/FugakuNEXT\">Fugaku-NEXT</a> would be zettascale, he was far from the only speaker to talk about mixed precision and emulation. In fact, it seemed like everyone wanted to talk about emulating FP64, specifically using NVIDIA's low-precision tensor cores and the <a href=\"https://doi.org/10.1007/s11075-011-9478-1\">Ozaki scheme</a> (or its derivatives). By the end of the week, I was absolutely sick of hearing about Ozaki.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p>For the unindoctrinated, this Ozaki scheme (and similar methods with less-catchy names) is a way to emulate matrix-matrix multiplications at high precision using low-precision matrix operations. It's become so hot because, despite requiring more arithmetic operations than a DGEMM implemented using WMMA/MFMA instructions, it can crank out a ton of FP64-equivalent operations per unit time. This is a result of the ridiculously nonlinear increases in throughput of low-precision tensor/matrix cores on modern GPUs; for example, Blackwell GPUs can perform over 100x more 8-bit ops than 64-bit ops despite being being only 8x smaller. As a result, you can burn a ton of 8-bit ops to emulate a single 64-bit matrix operation and still realize a significant net speedup over hardware-native FP64. Matsuoka presented the following slide to illustrate that:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Dr. Uchino's estimates of how many FP64 FLOPS one can emulate using INT8 as presented by Satoshi Matsuoka.</figcaption></figure></div><p>Emulation offers a way for scientific apps that need high-precision arithmetic to directly use AI-optimized accelerators that lack FP64 in hardware, so it's worth talking about at conferences like ISC. But it seems like <em>everyone</em> wanted to name-drop Ozaki, and the actual discussion around emulation was generally a rehash of content presented earlier in the year at conferences like <a href=\"https://blog.glennklockwood.com/2025/03/gtc-2025-recap.html\">GTC25</a>.</p><p>While hearing about FP64 emulation and Ozaki schemes got tiring throughout the week, I had to remind myself that I hadn't even heard about Ozaki before September 2024 at the Smoky Mountains Conference. The fact that the Ozaki scheme went from relative algorithmic obscurity to being the star of the show in nine months is either a reflection of its incredible importance in scientific computing or a testament to the reach of NVIDIA's marketing.</p><p>Cynically, I'll bet that NVIDIA is probably doing everything it can to make sure the world knows about the Ozaki scheme, and ISC was a part of that. When the datasheets for Rubin GPUs are released, I'll bet the performance table has a row claiming a bazillion FP64 FLOPS, and there will be a tiny footnote that clarifies they're citing emulated FP64 precision. They did it with structured sparsity, and I'm sure they'll do it for emulated DGEMM.</p><p>Although the Ozaki scheme is perhaps over-hyped considering how narrow its applicability is to the broad range of compute primitives used in scientific computing, I do anticipate that it is the tip of the iceberg. If 2025 was the year of the Ozaki scheme, 2026 may be the year of the emulated FP64 version of FFTs, sparse solvers, stencils, or other key algorithms. We're seeing signs of that already; David Keyes and Hatem Ltaief both presented material at ISC on using mixed-precision matrix operations for other scientific problems, and I mentioned <a href=\"https://blog.glennklockwood.com/2025/03/gtc-2025-recap.html#for-science\">their work in my earlier GTC25 blog</a>. I'm not sure \"the Keyes scheme\" or \"the Ltaief scheme\" is as catchy as \"the Ozaki scheme,\" but I expect to hear more about these other emulation techniques before ISC26.</p><h2 id=\"top500\">Top500</h2><p>On the topic of matrix-matrix multiplication, I can't get too much farther without talking about the Top500 list released at ISC. Although there was no new #1 system, Europe's first exascale system, JUPITER, made its sub-exascale debut. There were also a number of new entries in Top50, and surprisingly, many of them came from companies who offer GPUs-as-a-Service for AI training rather than the usual public-sector sites delivering cycles for scientific research. However, all the new entries were still using previous-generation Hopper GPUs despite huge Blackwell coming online, exposing a perceptible lag between the state of the art in supercomputers for AI and traditional HPC.</p><p>As with last year, I felt a growing tension between what the Top500 list brings to the discussion and where the large-scale supercomputing industry is headed. As I wrote earlier, mixed-precision and emulated FP64 was a hot topic in the technical program, but the emphasis of the Top500 session was still squarely on bulk-synchronous FP64 performance. HPL-MxP awards were handed out, but they all wound up in the hands of systems who were also at the top of the regular HPL list. Nobody is submitting HPL-MxP-only scores, and there was no meaningful discussion about the role that the Ozaki scheme will play going forward in Top500's future.</p><p>Opining about the long-term future of the Top500 list is a whole separate blog post though, so I'll focus more on what was covered at this year's session.</p><h3 id=\"jupiter\">JUPITER</h3><p>JUPITER was the only new entrant into the Top 10, and it posted at #4 with an average 793 PF over a hundred-minute run. Though it hasn't broken the 1 EF barrier yet, JUPITER is noteworthy for a few reasons:</p><ul><li>It is expected to be Europe's first exascale system. Given this HPL run <a href=\"https://bsky.app/profile/andih.bsky.social/post/3lrvrguvtzc2b\">used only 79% of the Booster Module's 5,884 GH200 nodes</a>, some basic extrapolation puts the full-system run just a hair above 1 EF. Jülich will either have to run with 100% node availability or get a few extra nodes to exceed 1 EF though.</li><li>JUPITER is also now the biggest NVIDIA-based supercomputer on Top500, pushing Microsoft's H100 SXM5 system (Eagle) down to #5. JUPITER is also Eviden's biggest system and a strong affirmation that Europe isn't dependent on HPE/Cray to deliver on-prem systems of this scale.</li></ul><p>JUPITER was also installed into a modular datacenter, an approach that is emerging as a preferred method for rapidly deploying large GPU systems in Europe. This setup allowed Jülich to place shipping container-like modules on a concrete foundation in just a few months. However, because the datacenter is form-fit to the JUPITER system without much extra space, it's impossible to take a glamor shot of the entire machine from far away. As a result, most photos of JUPITER show only the datacenter modules that wrap the supercomputer racks. For example, Prof. Thomas Lippert shared this photo of JUPITER during his presentation:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">JUPITER's modular datacenter as seen from a drone flying overhead.</figcaption></figure></div><p>As Lippert was describing JUPITER, I couldn't help but compare it to the AI supercomputers I support at my day job. Like JUPITER, our supercomputers (like Eagle) aren't very photogenic because they're crammed into form-fitted buildings, and they are best photographed from the sky rather than the ground. For example, here's a photo of one of Microsoft's big GB200 supercomputers that I presented later in the week:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">A slide showing one of Microsoft's big GB200 supercomputers that I presented at the SuperCompCloud workshop later in the week. The big two-story building in the center houses GPUs, and the long white building on the right houses storage and CPU-only nodes.</figcaption></figure></div><p>JUPITER may be the first exascale system listed on Top500 that doesn't have fancy rack graphics, but I don't think it will be the last.</p><p>I also found myself wondering if these modular datacenters are trading short-term upsides with long-term downsides. While they accelerate deployment time for one-off supercomputers, it wasn't clear to me if these modular structures is reusable. Does the entire datacenter retire along with JUPITER after 5-7 years?</p><p>Hyperscalers use modular datacenters too, but the modularity is more coarse-grained to support a wider variety of systems over multiple decades. They're also physically more capacious, allowing them to deploy more CDUs and transformers per rack or row to retrofit them for whatever power and cooling demands evolve into over the full depreciation life of the datacenter building.</p><h3 id=\"hpcai-system-intersection\">HPC-AI system intersection</h3><p>As with last year, Erich Strohmeier did a walkthrough of Top500 highlights, and he argued that \"hyperscale\" is defined as anything bigger than 50 MW, and therefore the Top500 list is hyperscale. It wasn't clear what value there was in trying to tie the Top500 list to hyperscale in this way, but there were a few ways in which Top500 is beginning to intersect with hyperscale AI.</p><p>Foremost is the way in which some exascale systems have been appearing on the list: they first appear after HPL is run on a big but partially deployed machine, then six months later, the full-system run is listed. Aurora and JUPITER both follow this pattern. What's not obvious is that many massive AI supercomputers also do something like this; for example, the Eagle system's 561 PF run was analogous to <a href=\"https://top500.org/lists/top500/2023/11/\">Aurora's initial 585 PF run</a> or JUPITER's 793 PF run. The difference is that systems like Eagle typically enter production training after that first big tranche of GPUs is online, so there is never an opportunity to run HPL as more of the system powers up. Instead, the production training job simply expands to consume all the new GPUs as new tranches come online.</p><p>This iteration of the Top500 list also saw a number of bona fide commercial AI training clusters from smaller GPU-as-a-Service and \"AI factory\" providers post results, giving the public a view of what these systems actually look like:</p><ul><li>Nebius listed <a href=\"https://top500.org/system/180366/\">ISEG2</a> at #13 with a 624-node, 202 PF H200 SXM cluster, following their 2023 Top500 debut with a 190-node, 46 PF H100 SXM cluster. Nebius was spun out of Yandex, the Russian tech conglomerate.</li><li>Northern Data Group debuted <a href=\"https://top500.org/system/180378/\">Njoerd</a> at #26 with a 244-node H100 SXM cluster. Northern Data Group started out as a German bitcoin mining company.</li><li>FPT debuted at #36 with a <a href=\"https://top500.org/system/180399/\">127-node H200 SXM cluster</a> and #38 with a <a href=\"https://top500.org/system/180387/\">127-node H100 SXM cluster</a>. FPT is a Vietnamese technology conglomerate.</li></ul><p>It's notable that none of these systems resemble the sovereign AI systems or EuroHPC AI Factories cropping up in Europe, which are attached to traditional HPC centers and built on familiar HPC platforms like Cray EX or BullSequana. Rather, they're essentially NVIDIA reference architectures that resemble DGX SuperPods but are stamped out by companies like Supermicro, Gigabyte, and ASUS.</p><p>While it's nice of these GPU-as-a-Service companies to participate in the Top500 list, I did not see anyone from these companies in the technical program in any other way. And I did not see anyone from the bigger GPU-as-a-Service providers (CoreWeave, Crusoe, Lambda, etc) contributing either. Thus, while these companies are participating in Top500, it doesn't seem like they're genuinely interested in being a part of the HPC community.</p><h3 id=\"other-new-entrants\">Other new entrants</h3><p>If you take a step back and look at the ten largest systems that made their debut at ISC'25, they broadly divide into two categories. Here's the list:</p><div><table style=\"border-collapse: collapse; font-family: sans-serif; font-size: 0.9em; width: 100%;\"><thead style=\"background-color: #f2f2f2;\"><tr><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Rank</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">System</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Platform</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Site</th></tr></thead><tbody><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">4</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">JUPITER Booster</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">GH200</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Jülich</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">11</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Isambard-AI phase 2</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">GH200</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Bristol</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">13</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">ISEG2</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">H200 SXM5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Nebius</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">15</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">ABCI 3.0</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">H200 SXM5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">AIST</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">17</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Discovery 6</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">GH200</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">ExxonMobil</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">18</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">SSC-24</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">H100 SXM5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Samsung</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">26</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Njoerd</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">H100 SXM5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Northern Data Group</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">27</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">ABCI-Q</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">H100 SXM5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">AIST</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">33</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">AI-03</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">MI210</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Core42</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">36</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">FPT AI Factory Japan</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">H200 SXM5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">FPT</td></tr></tbody></table></div><p>Aside from Core42's weird MI210 cluster, every new big system was either GH200 (for traditional HPC) or H100/H200 SXM5 (for AI). This suggests a few interesting things:</p><ul><li>None of the AI cloud/GPUaaS providers are talking about GH200. It seems that GH200 is squarely for scientific computing, and Hopper HGX systems is preferred for AI at scale.</li><li>Despite debuting on Top500 two years ago, H100 is still making its way into the hands of HPC and AI sites. This could mean one of several things:<ul><li>H100 is more affordable now (<a href=\"https://www.nextplatform.com/2025/05/08/supermicro-hiccups-on-hopper-pulls-40-billion-guidance-for-fiscal-2026/#:~:text=“But%20I%20said,say%20that.”%20%5Blaughter%5D\">Jensen says he can't give them away</a>),</li><li>there was a huge backlog of H100 orders, or</li><li>it's just taking some places a really long time to get H100 up and running</li></ul></li><li>Blackwell is not relevant to HPC right now. There are no big Blackwell systems on this list, nor was Blackwell discussed in any sessions I attended during the week. This is despite large GB200 systems being public, up, and benchmarked. For example, <a href=\"https://github.com/mlcommons/training_results_v5.0/blob/main/IBM%2BCoreWeave%2BNVIDIA/systems/carina_ngpu2496_ngc25.04_nemo.json\">CoreWeave, IBM, and NVIDIA ran MLPerf Training across 39 racks (624 nodes) of a GB200 NVL72 system named Carina just last month</a>. They did not appear to bother with HPL, though.</li></ul><p>From all this, it seems like there is a definite lag forming between what qualifies as \"leadership computing\" to HPC people and AI people. Today's leadership HPC (Hopper GPUs) is yesterday's leadership AI, and today's leadership AI (Blackwell GPUs) isn't on the radar of leadership HPC yet. Maybe GB200 will begin appearing one or two years later as the AI people move on to Vera-Rubin.</p><p>So, if I had to guess, I think the top-end of Top500 in 2027 could look like one of three things:</p><ol type=\"1\"><li>It will contain HPC systems with state-of-the-art, HPC-specific variants of accelerators that are completely irrelevant to AI. Large AI training systems will simply disappear from the list, because HPL has ceased to be a meaningful measure of their capability. GB200/GB300 simply never appear on Top500.</li><li>It will contain HPC systems with previous-generation Blackwell accelerators after Jensen (the chief revenue destroyer) gets on stage and tells the world that Blackwell is junk because Rubin is awesome. The AI industry gobbles up all the Rubin GPUs, and HPC picks up the scraps they leave behind.</li><li>Top500 starts allowing FP64 emulation, and all bets are off on how ridiculous the top systems' numbers look. In this case, top systems just skip the 1-10 exaflops range and start debuting at tens of exaflops.</li></ol><p>I have no idea where things will go, but we're starting to see <a href=\"https://www.nersc.gov/what-we-do/computing-for-science/doudna-system\">big HPC deals</a> <a href=\"https://blogs.nvidia.com/blog/blue-lion-vera-rubin/\">targeting Vera Rubin</a> that line up with the same time Rubin will land for the AI industry in 2H2026. So maybe Blackwell is just a hiccup, and option #1 is the most likely outcome.</p><h2 id=\"hpc-around-the-world\">HPC around the world</h2><p>Though Blackwell's absence from Top500 was easy to overlook, China's continued absence was much more obvious. Even though no new Chinese systems have been listed in a few years now though, representatives from several Chinese supercomputing centers still contributed invited talks throughout the week.</p><p>In that context, I appreciated how fully ISC embraces its international scope. I found myself attending a lot of \"HPC Around the World\" track sessions this year, partly because I work for a multinational corporation and have to stay aware of potential needs outside of the usual US landscape. But there's also been a sharp rise in the amount of serious HPC that is now occurring outside of the USA under the banner of \"sovereign AI,\" and I've been keen to understand how \"sovereign AI\" compares to the US-based AI infrastructure in which I work.</p><p>Before getting too deep into that though, China is worth discussing on its own since they had a such prominent presence in the ISC program this year.</p><h3 id=\"hpc-in-china\">HPC in China</h3><p>Following the single-track opening keynote on the first day of ISC is the single-track Jack Dongarra Early Career Award Lecture, and this year's talk was given by awardee Prof. Lin Gan from Tsinghua University. In addition, Dr. Yutong Lu gave two separate talks--including the closing keynote--which shed light on the similarities and differences between how China and the US/Europe are tackling the challenges of exascale and beyond.</p><p>China is in a position where it does not have access to US-made GPUs, forcing them to develop their own home-grown processors and accelerators to meet their needs for leadership computing. As a result, both speakers gave talks that (refreshingly) revolved around non-GPU technologies as the basis for exascale supercomputers. Although neither Gan nor Lu revealed anything that wasn't already written about in the Gordon Bell prize papers, I took away a few noteworthy observations:</p><p><strong>The most public Chinese exascale system is always called the \"New Sunway\" or \"Next Generation Sunway,\" never \"OceanLight\"</strong> as has been reported in western media. There still aren't any photos of the machine either, and Dr. Gan used stock diagrams of the predecessor Sunway TaihuLight to represent New Sunway. There was no mention of the Tianhe Xingyi/TH-3 supercomputer at all.</p><p><strong>Chinese leadership computing details remain deliberately obfuscated despite the openness to present at ISC.</strong> For example, Lu presented the following English-language table from the <a href=\"https://www.csiam.org.cn/1003/202411/2246.html\">2024 China Top100 HPC list</a>:</p><div><table style=\"border-collapse: collapse; font-family: sans-serif; font-size: 0.75em; white-space: nowrap;\"><thead style=\"background-color: #f2f2f2;\"><tr><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">No.</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Vendor</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">System</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Site</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Year</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: left;\">Application</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">CPU Cores</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">Linpack (Tflops)</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">Peak (Tflops)</th><th style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">Efficiency (%)</th></tr></thead><tbody><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">1</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Supercomputing system mainframe system, heterogeneous many-core processor</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Supercomputing Center</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2023</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">computing service</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">15,974,400</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">487,540</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">620,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">78.7</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet Company Mainframe System, CPU+GPU heterogeneous many-core processor</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet company</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2022</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">computing service</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">460,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">208,260</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">390,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">53.4</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">3</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet Company Mainframe System, CPU+GPU heterogeneous many-core processor</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet company</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2021</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">computing service</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">285,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">125,040</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">240,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">52.1</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">4</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">NRCPC</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Sunway TaihuLight, 40960*Sunway SW26010 260C 1.45GHz, customized interconnection</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">NSCC-WX</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2016</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">supercomputing center</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">10,649,600</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">93,015</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">125,436</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">74.2</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">5</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet Company Mainframe System, CPU+GPU heterogeneous many-core processor</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet company</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2021</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">computing service</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">190,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">87,040</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">160,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">51.2</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">6</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">NUDT</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Tianhe-2A, TH-IVB-MTX Cluster + 35584*Intel Xeon E5-2692v2 12C 2.2GHz + 35584 Matrix-2000, TH Express-2</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">NSCC-GZ</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2017</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">supercomputing center</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">427,008</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">61,445</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">100,679</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">61.0</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">7</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet Company Mainframe System, CPU+GPU heterogeneous many-core processor</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Internet company</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2021</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">computing service</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">120,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">55,880</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">110,000</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">50.8</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">8</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">ShenweiJing Supercomputer System, 1024*SW26010Pro heterogeneous many-core processor 390C MPE 2.1 GHz</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Computing Company</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2022</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">scientific computing</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">399,360</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">12,912</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">14,362</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">89.9</td></tr><tr style=\"background-color: white;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">9</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Server Provider</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Supercomputing Center System, 992*SW26010Pro heterogeneous many-core processor 390C MPE 2.1 GHz</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">Supercomputing Center</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2021</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">scientific computing</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">386,880</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">12,569</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">13,913.0</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">90.3</td></tr><tr style=\"background-color: #f9f9f9;\"><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">10</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">BSCCC/Intel</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">BSCCC T6 Section 5360*Intel Xeon Platinum 9242 homogeneous many-core processor 48C 2.3 GHz, EDR</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">BSCCC</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">2021</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px;\">computing service</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">257,280</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">10,837</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">18,935.0</td><td style=\"border: 1px solid rgb(204, 204, 204); padding: 8px; text-align: right;\">57.2</td></tr></tbody></table></div><p>The #1 system is almost definitely built on SW26010P processors just like the big New Sunway system that Gan discussed (15,974,400 cores / 390 cores per SW26010P = 40,960 nodes), but it's significantly smaller than the 39M cores on which the work Gan highlighted was run. Clearly, China's biggest systems aren't on their own Top100 list, and their #1 listed system only says its processors are \"heterogeneous many-core\" despite smaller entries explicitly listing SW26010P (Pro) processors.</p><p><strong>Chinese leadership computing struggles aren't being hidden</strong>. Lu specifically called out a \"lack of a new system\" in 2024, echoing earlier sentiments from other leaders in Chinese HPC who have referred to <a href=\"https://news.sciencenet.cn/htmlnews/2024/11/534141.shtm\">\"some difficulties in recent years\" and a \"cold winter\" of HPC</a>. She also said that their leadership systems are \"relatively\" stable rather than trying to overstate the greatness of Chinese HPC technology. But as with above, she didn't get into specifics; by comparison, Scott Atchley (of Oak Ridge Leadership Computing Facility) specifically quoted a 10-12 hour mean time between job interrupt on Frontier after his keynote. Whether 10-12 hours is \"relatively stable\" remained unspoken.</p><p><strong>Performance portability wasn't a top-line concern despite how hard it seems to port applications to Chinese accelerators.</strong> SW26010P is weird in that it has a host core and offload cores with scratchpads, and its native programming model (Athread) is very CUDA-like as a result. Gan made it seem that China is investing a lot of effort into \"fine-grained optimizations\" using OpenACC and Athread, and he showed all the ways in which they're rewriting a lot of the kernels and decompositions in complex applications (like <a href=\"https://www.cesm.ucar.edu/models/cam\">CAM</a>) to make this work. This sounds like an performance portability nightmare, yet there wasn't much talk about Chinese equivalents to performance portability frameworks like Kokkos, RAJA, or alpaka.</p><p>Lu did name-drop a few frameworks that unify HPC and AI performance portability from around the world:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">Yutong Lu's only reference to software that enhances portability and productivity. Not quite the same as what Kokkos, Raja, and alpaka aim to solve, though.</figcaption></figure></div><p>However, these were more about aligning efforts across scientific computing and AI rather than enabling scientific apps to run seamlessly across China's different exascale accelerators.</p><p><strong>Application focus areas in China seem similar to everywhere else.</strong> Classical and quantum materials modeling, climate and ocean modeling, electronic structure calculations, and genomics were all mentioned by Gan and Lu in their talks. There was no mention of stockpile stewardship or any defense-related applications of HPC, though I'm sure China is using big supercomputers in these efforts just as US and European nations do. The only unusual application that I noticed was Gan's mention of implementing reverse time migration (RTM) on FPGAs; I've only ever heard of RTM in the context of oil exploration. Though I'm no expert, I didn't think many HPC centers spent a lot of time focusing on that technique. I do know KAUST has done some work optimizing RTM applications with Aramco in the space, but most other national supercomputing centers keep oil and gas at arm's length. Gan's RTM work may be related to earthquake modeling rather than petroleum, but it stood out nonetheless.</p><p><strong>Nobody talked about GPUs.</strong> Gan spent a healthy amount of time talking about applying FPGAs and NPUs to scientific problems, but these are areas of research that are on the fringes of mainstream HPC. I'm not sure if this reflected his own interests or priority research directions in China, but given that Chinese researchers cannot procure NVIDIA or AMD GPUs, perhaps FPGAs and NPUs are being pursued as a potential next-best-thing. Necessity truly is the mother of invention, and China might be the driver of a disproportionate amount of innovation around dataflow processing and reduced precision for modeling and simulation workloads.</p><p><strong>Nobody talked about storage either.</strong> I'm not sure if this suggests China has a lopsided interest in compute over holistic system design, or if they just talked about their biggest challenges (which are using home-grown accelerators productively). Granted, keynote speakers rarely talk about storage, but I didn't see much participation from China in any of the subsystem-specific sessions I attended either. This is particularly notable since, for a time, Chinese research labs were dominating the IO500 list with their home-made file systems. Networking was mentioned in passing in Lu's closing keynote, but not much beyond another example of technology fragmentation, and there were no specific Chinese interconnects being discussed during the week.</p><p><strong>China is in the thick of AI just like the rest of the world.</strong> Lu said that 30% of the cycles on their big HPC systems go to AI, which is right in line with anecdotes from other HPC sites that put their figures at <a href=\"https://csc.fi/en/media-release/lumis-capacity-in-high-demand-to-be-succeeded-by-an-ai-optimized-supercomputer/?utm_source=chatgpt.com\">somewhere up to 50%</a>. She also presented the Chinese taxonomy of the three ways in which AI and scientific computing can mesh together: HPC for AI (training LLMs on supercomputers), HPC by AI (AI for system design and operations), and HPC and AI (AI in the loop with simulation). China is also neck-deep in figuring out how to exploit reduced precision (or \"intelligent computing,\" as Lu branded it) and has pivoted from being \"performance driven\" (which I took to mean HPL-driven) to \"target driven\" (which I took to mean scientific outcome-driven). This is consistent with their recent Gordon Bell prize win and non-participation in either Top500 or China Top100.</p><p><strong>China is embracing geo-distributed supercomputing and complex workflows</strong>, much like the US. Lu specifically called out \"Computility Net,\" a catchy name that sounded a lot like the US DOE's Integrated Research Infrastructure (IRI). She described it as a national effort to combine supercomputing with \"commodity IT\" resources (perhaps Chinese cloud?) to enable \"resource sharing\" through a \"service grid.\" In her closing keynote, she even name-dropped IRI:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">The Chinese vision for Computility Net, which seems analogous to the US Integrated Research Infrastructure, as presented by Yutong Lu.</figcaption></figure></div><p>She did liken Computility to both IRI in the US and PRACE in the EU though, and in my mind, PRACE is nothing like IRI. Rather, PRACE is more like TeraGrid/XSEDE/ACCESS in that it federates access to HPC systems across different institutions, whereas IRI's ambition is to tightly integrate computational and experimental facilities around the country. But from the above slide, it sounds like Computility Net is closer to IRI since it is coupled to \"Supercomputing internet\" (akin to ESnet?) and bridging compute and data across eastern and western China.</p><h3 id=\"elsewhere-in-asia\">Elsewhere in Asia</h3><p>Although Chinese researchers headlined a few sessions at ISC, a number of other Asian nations presented their national supercomputing strategies as well. Japan and Korea have mature, world-class HPC programs, but I was surprised to see how ambitious India has become to catch up. Smaller nations were also represented, but it was clear to me that their focus is spread across midrange HPC, partnering with large centers in Korea/Japan, and innovating around the edges of supercomputing. And perhaps unsurprisingly, every nation represented had a story around both quantum computing and artificial intelligence regardless of how modest their production modsim infrastructure was.</p><p><strong>India</strong> appears to rapidly catching up to the US, Europe, and Japan much in the same way China was fifteen years ago. Representatives from C-DAC, the R&amp;D organization that owns the national supercomputing mission in India, gave a far-reaching presentation about India's ambition to achieve exascale by 2030. Their current strategy appears to be broad and capacity-oriented, with forty petascale clusters spread across India for academic, industrial, and domain-specific research. They have a comprehensive, if generic, strategy that involves international collaboration in some regards, reliance on open-source software to fill out their HPC environment story, and home-grown hardware and infrastructure:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">India's ambitious strategy towards exascale in 2030. This slide has it all, from home-grown CPUs and networks to five systems deployed in six years.</figcaption></figure></div><p>I was surprised to hear about their ambitions to deploy their own CPUs and interconnect though. India is pursuing both ARM and RISC-V for their own CPUs for a future 200 PF system, and they're already deploying their \"InfiniBand-like\" interconnect, TRINETRA, which uses funny NICs with <a href=\"https://cdac.in/index.aspx?id=product_details&amp;productId=TrinetraHPCInterconnect\">6x100G ports or 10x200G ports</a> rather than fewer, faster serdes. I didn't hear mention of their AI acceleration plans, but rolling their own commercialized CPU and interconnect in itself is a lot to bite off. Given that India is the world's fastest growing economy though, these plans to go from 20 PF in 2025 to 1 EF in 2030 may not be that far-fetched. Perhaps the Indian national strategy will become clearer during the inaugural <a href=\"https://sc-india.in\">Supercomputing India 2025 conferece</a> this December.</p><p>The <strong>Korea Institute of Science and Technology Information</strong> also took the stage to describe their next national supercomputer, <a href=\"https://www.glennklockwood.com/garden/systems/KISTI-6\">KISTI-6</a>, which was first announced in May 2025. It will be a 588 PF Cray EX254n system with 2,084 nodes of GH200, similar to <a href=\"https://www.glennklockwood.com/garden/systems/Alps\">Alps</a> and <a href=\"https://www.glennklockwood.com/garden/systems/Isambard-AI\">Isambard-AI</a>. This is quite a step up from its predecessor, which was an air-cooled KNL system, but it's unlikely it will unseat Fugaku; the 588 PF number cited appears to be the sum of 2,084 GH200 nodes, 800 Turin CPU nodes, and 20 H200 SXM5 nodes. The HPL score of its GH200 nodes will place it below <a href=\"https://www.glennklockwood.com/garden/systems/Alps\">Alps</a> and somewhere around 350 PF, likely joining a flood of multi-hundred-petaflops GH200 systems that will appear between now and ISC26.</p><p><strong>Singapore (NSCC) and Taiwan (NCHC)</strong> both presented their national programs as well, but they appear to be much more nascent, and the size of their HPC infrastructure was presented as aggregate capacity, not capability. Their strategies involve partnership with Japan or Korea, but both had specific carveouts for both sovereign AI and quantum computing. Interestingly, their use cases for AI both had a strong story about training models that understood the diversity of languages and dialects represented in their nations. For example, it is not unusual for people to switch languages or dialects mid-sentence in Singapore, and the big Western models aren't designed for that reality. Similarly, Taiwan has 16 indigenous tribes with 42 dialects. It seemed like enabling LLMs that reflect the breadth languages used in Singapore and Taiwan have become the responsibility of these nations' respective national supercomputing efforts.</p><p>That said, that noble mission didn't seem to be matched with substantial training infrastructure; these localized models will be relying on a couple hundred GPUs here and there, wedged into existing HPC centers. Thus, these sovereign models are probably going to be fine-tuned variants of open models, aligning with my earlier observation that these smaller nations will be innovating around the edges of HPC and AI.</p><p><strong>What was missing?</strong> Although Vietnam, Thailand, Malaysia, and other Asian nations have strong HPC programs centered around industrial uses, they were not represented in ISC's HPC Around the World track. Also absent was any meaningful discussion around cloud; while everyone had a throwaway line about cloud in their presentations, the fact that the only big clouds in Asia are Chinese and American probably makes it unappealing to integrate them into the core of these nations' national HPC strategies. Speaking from experience, this is quite different from the attitudes of commercial HPC users across Asia who are all too happy to let someone else run HPC datacenters for them.</p><h3 id=\"the-middle-east\">The Middle East</h3><p>Although KAUST has been a world-class HPC center in the Middle East for the past fifteen years, AI seems to be where the majority of new investment into HPC is going.</p><p>In describing new efforts in Saudi Arabia, Prof. David Keyes casually mentioned the Saudi HUMAIN effort, which will build 500 MW of datacenter capacity and 18,000 GB300 GPUs, after describing the Shaheen-3 GH200 upgrade that \"might (barely)\" put it back in the Top20 by SC'25. Similarly, Dr. Horst Simon walked through a few of Abu Dhabi's university clusters (each having dozens of GPU nodes) after skating through an announcement that a 5 GW AI campus was also being built in Abu Dhabi. The gap between investment in AI and investment in HPC was striking.</p><p>I also had a brief conversation with someone from one of the major Abu Dhabi universities, and I was very surprised to find that I was talking to a real AI practitioner--not an HPC person moonlighting in AI--who spoke at the same depth as the customers with whom I work in my day job. The nature of his work made it clear to me that, despite his university not having a Top500 system, he was familiar with running training and inference at scales and with sophistication that is far beyond the experience of most ISC attendees.</p><p>These interactions led me to the conclusion that the Middle East's approach to \"sovereign AI\" is quite different from Europe's. Rather than building HPC systems with GPUs, letting HPC centers operate them, and calling them sovereign AI platforms, nations like Saudi Arabia and UAE are keeping HPC and AI separate. Like in the US, they are going straight to hyperscale with AI, and they have no preconceived notion that anything resembling a supercomputer must be hosted at a supercomputer center.</p><p>Of course, only nations like Saudi Arabia and UAE can afford to do this, because they have trillion-dollar sovereign wealth funds to invest in massive infrastructure buildout that doesn't isn't contingent on public consensus or the latest election cycle. Just as UAE's Core42 can build a 5 GW datacenter campus with little oversight, these nations can easily mis-step and invest a ton of money in an AI technology that turns out to be a flop. In the end, it seems like these Middle Eastern nations are willing to take bigger risks in how they build out their sovereign AI infrastructure, because they are largely starting from a blank sheet of paper. They aren't limiting themselves to 20 MW supercomputers like the HPC world had.</p><p>All things being equal, this might turn out to be an advantage over other nations who are more hesitant to deviate from the tried-and-true course of buying a Cray or a Bull, sticking some GPUs in it, and calling it AI. If these Middle Eastern nations do everything right, they stand to get a lot further and move a lot faster in sovereign AI than Europe, and it'll be fascinating to see how quickly they catch up with the sort of frontier AI research being done private industry. But, as with the US AI industry, it doesn't seem like these AI practitioners are going to be attending ISC in the same way European sovereign AI folks do; the roads of HPC and AI seem to run parallel without intersecting in the Middle East.</p><h2 id=\"exhibitors\">Exhibitors</h2><p>ISC had a <a href=\"https://isc-hpc.com/the-isc-2025-exhibition-sets-new-records/\">record number of exhibitors this year</a>, and as usual, I tried to set aside at least an hour or two to walk the floor and see what technologies are on the horizon. This year, though, the exhibit hall was not a great representation of the rest of the conference. Everyone I talked to about the exhibit said one of two things:</p><ol type=\"1\"><li>There are a LOT of quantum companies.</li><li>A lot of big companies were noticeably absent.</li></ol><p>It also didn't feel like the biggest exhibit ever, partially because of #2, and partially because many of the exhibitors--one in five--was exhibiting for the first time this year. This meant a lot of the booths were small and barebones, and many of them belonged to either companies at the periphery of HPC (such as companies that make dripless couplers for liquid cooling) or small startups who just had a desk, a few pens, and some brochures.</p><p>On the first point, it was true--quantum computing was well represented, with 22% of exhibitors identifying as being involved in the field in some form. In fact, quantum felt over-represented, since the ISC technical program certainly didn't have such a large fraction of talks on quantum computing topics. I didn't have time to actually talk with any of these quantum companies though, so wasn't able to get a sense of why the startup ecosystem around quantum computing was so rich in Europe as compared to the US.</p><p>While there was an abundance of quantum this year, a number of the big HPC and HPC-adjacent companies were noticeably absent:</p><ul><li>Amazon, Azure, and Google did not have booths despite having booths last year. Amazon and Google still sponsored the conference at the lowest tier (bronze) though, while Microsoft did not sponsor at all.</li><li>Intel had neither booth nor sponsorship despite having the #3 system on Top500. I don't think they held a party this year, either. AMD didn't have a booth, but they sponsored (and gave the opening keynote!)</li><li>WEKA neither had a booth nor sponsored the conference this year, although they were the leading sponsor of the Student Cluster Competition. Competitors DDN, VAST, Quobyte, and BeeGFS all had booths, but only VAST sponsored. Curiously, Pure and Scality, which do not big footholds in leadership HPC, did both booths and sponsorship.</li></ul><p>These companies who chose not to have a booth still sent people to the conference and were conducting meetings as usual, though. This suggests that there's something amiss with how large companies perceive the return on investment of having a booth at ISC. I don't have any insider knowledge here, but I was surprised by the pullback since ISC has historically been very good at incentivizing attendees to walk through the expo hall by putting it between the technical sessions and the food breaks.</p><p>As I walked the exhibit floor, I found that prominent booths spanned the whole HPC stack: software, system integrators, component makers (CPUs, GPUs, HBM and DDR, and SSD and HDD), and datacenter infrastructure were all exhibiting. The most eye-catching booths were those with big iron on display: HPE/Cray had a full EX4000 cabinet and CDU on display, and there were a few Eviden BullSequana nodes floating around.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">The Cray EX4000 cabinet (right) and its CDU (left) on display at the ISC'25 exhibition hall. One of the most eye-catching displays, even they've been on display at ISC and SC for a few years now.</figcaption></figure></div><p>Sadly, though, there were no full BullSequana X3000 racks on display. I've still never seen one in real life.</p><p>Infrastructure companies like Motivair (who manufactures the CDUs for Cray EX) and Rittal (which I know as a company that manufactures racks) also had big liquid-liquid head exchangers on display with shiny steel piping. Here's a smaller version of the Cray EX CDU that Motivair was displaying:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure><figcaption class=\"image-caption\">A close-up view of a smaller liquid-liquid heat exchanger CDU on display at the Motivair booth right next to HPE's. Strangely, the mechanics of these systems dovetails with what I've learned as a part of my other hobby outside of HPC, which is operating a multi-family residential high-rise.</figcaption></figure></div><p>I got to chatting with some good folks at Motivair, and I learned that the 1.2 MW variant that is used with Cray EX has a 4\" connection--the same size as the water main in <a href=\"https://glennklockwood.com/garden/LRA\">my coop</a>. Since I recently helped with the replacement of my building's water main, this led me down a rabbithole where I realized that the flow rates for this CDU is roughly the same as my apartment building too, which is to say, a single Cray CDU moves as much fluid as a 55-unit apartment building. Incidentally, a single Cray EX cabinet supports roughly the same electrical capacity as my 55-unit building too--I am in the process of replacing our 1,200 A service panel, which comes out to about the same 400 kVA as fully loaded EX.</p><p>Aside from the Cray cabinets and CDUs, which are no longer new to ISC, I couldn't put my finger on any particularly outstanding booths this year though. The exhibit felt like a sea of smaller companies, none of which really grabbed me. This isn't to say that big vendors were wholly absent though. Despite not having booths, all three big cloud providers threw parties during the week: AWS and NVIDIA teamed up on a big party with over a thousand registrants, while Google and Microsoft held smaller parties towards the end of the week. HPE also threw a lovely event that was off the beaten path along the Elbe, resulting in a less-crowded affair that made it easy to catch up with old friends.</p><p>I may be reading too much into this year's exhibit, but it felt like ISC might be transforming into an event for smaller companies to gain visibility in the HPC market, while larger companies apply their pennies only in the parts of the conference with the highest return. Whether a company chose to have a booth, sponsor the conference, and/or throw a party seemed to defy a consistent pattern though, so perhaps other factors were at play this year.</p><h2 id=\"cloud-or-lack-thereof\">Cloud, or lack thereof</h2><p>Because I work for a large cloud service provider, I attended as many cloud HPC sessions as I could, and frankly, I was disappointed. The clear message I got by the end of the week was that Europe--or perhaps just ISC--doesn't really care about the cloud. This is quite different from the view in the US, where the emergence of <a href=\"https://www.glennklockwood.com/garden/systems/Eagle\">massive AI supercomputers</a> has begun to shift opinions to the point where <a href=\"https://www.theregister.com/2024/07/24/oak_ridge_discovery/\">the successor to the Frontier supercomputer at OLCF might wind up in the cloud</a>. I suppose cloud is a lot less attractive outside of the US, since all the major cloud providers are US corporations, but the way in which cloud topics were incorporated into the ISC program this year sometimes felt like a box-checking exercise.</p><p>For example, I attended the BOF on \"Towards a Strategy for Future Research Infrastructures\" which I expected to be a place where we discussed the best ways to integrate traditional HPC with stateful services and other workflow components. While cloud was mentioned by just about every panelist, it was almost always in a throwaway statement, lumped in with \"the edge\" or cited as a vague benefit to \"new workflows and interactive analysis\" with no further detail. One speaker even cited egress fees as a big challenge which, to me, means they haven't actually talked to a cloud provider in the last five to ten years. If egress fees are what stop you from using the cloud, you're talking to the wrong account team.</p><p>I get it though; there are times where cloud often doesn't offer enough obvious benefit for HPC to justify the effort required to figure it out. In those cases, it's incumbent on cloud providers to provide a better story. But I was also disappointed by the invited session called \"Bridging the Gap: HPC in the Cloud and Cloud Technologies in HPC,\" which I hoped would be the place where cloud providers could make this case. Instead, only two of the three CSPs were even invited to speak, and it was clear that the speakers did not all get the same assignment with their invitations. Granted, the CSP for whom I work was the one not invited (so I came in a little biased), but I was surprised by how differently each speaker used their time.</p><p>Dr. Maxime Martinasso from CSCS gave a talk from the perspective of trying to add cloud-like capabilities to a supercomputer, which is a recurring pattern across a number of sites (including many in the US DOE) and projects. He explained the way they're creating an infrastructure-as-code domain-specific language that sits on top of <a href=\"https://www.glennklockwood.com/garden/systems/Alps\">Alps</a>, their Cray EX system, to give users the ability to bring their own software stacks (all the way down through Slurm) to the supercomputer. It was clearly a ton of work on CSCS's part to develop this capability, and yet the talk's \"future work\" slide contained a bunch of features which those of us in the cloud would consider \"P0\"--priority zero, or essential for a minimum viable product.</p><p>By the end of Martinasso's talk, I realized that CSCS's perspective is that, unlike commercial cloud, these cloudy features aren't P0; having a supercomputer on the floor is. He made the case that CSCS has a need to explore diverse computing architectures and accelerators (as evidenced by the five different node types in Alps!), and putting them all on a single RDMA fabric isn't something any cloud provider will do. As a result, adding any new cloud-like capability to the heterogeneous supercomputer is just gravy, and the fact that true cloud is more \"cloudy\" than Alps is irrelevant since the cloud will never support the intra-fabric heterogeneity that Alps does.</p><p>The other two speakers represented big cloud providers, and their talks had a bit more product pitch in them. One speaker talked through the challenges the cloud is facing in trying to fold supercomputing principles into existing cloud infrastructure (a theme I repeated in my talk later in the week) before talking about specific products that have arisen from that. It touched on some interesting technologies that the HPC world hasn't yet adopted (like optical circuit switching--super cool stuff for programmable fabrics), and I learned a few things about how that provider might bring new HPC capabilities to the table for specific workloads.</p><p>The other speaker, though, presented a textbook pitch deck. I've give almost the same exact presentation, down to showing the same sort of customer stories and product comparison tables, during customer briefings. Execs in the audience would eat it up while engineers' eyes would glaze over, and having to do that song and dance is partly why I didn't make it as a product manager. <a href=\"https://bsky.app/profile/glennklockwood.com/post/3lrd3usnt222d\">I was incredulous</a> that such a presentation was an invited talk at one of the most prestigious HPC conferences in the world.</p><p>This is not to say I was mad at the speaker. He did exactly what one would expect from a leader in the sales side of an organization, hitting all the notes you'd want in a textbook pitch aimed at the C-suite. Rather, I was disappointed by the choice by the session organizers; when you invite someone whose job is driving business at one of the largest cloud providers to speak, you should fully expect a broad and salesy presentation. I don't think it's a stretch to say that most ISC attendees aren't looking for these sorts of high-level talks designed for enterprise decision-makers; they want insight and technical depth.</p><p>Was I miffed that a competitor got to give a twenty-minute sales pitch during a session at which I wasn't invited to speak? Absolutely. And do I think I could've given a talk that even the most ardent cloud-hater would find something interesting in it? Probably. But since that didn't happen, the best I can do is complain about it on the Internet and hope that next year's program committee puts more care into organizing an invited speaker session on cloud and HPC.</p><p>Thankfully, I was given the opportunity to talk a little about my work at the <a href=\"https://sites.google.com/view/supercompcloud/isc25-9th-supercompcloud-workshop#h.fur7sdv6h19a\">SuperCompCloud workshop</a> on Friday. That workshop felt like what the \"Bridging the Gap\" invited session should've been, and there were roughly equal parts of presentations on adding cloud-like features to their HPC infrastructure and adding HPC-like features to cloud infrastructure. From my perspective, the workshop was great; I got to see how traditional HPC centers are adopting cloud practices into their operations, and I could explain how we overcame some of the challenges they're facing in Azure. But to my point at the outset of this section--that Europe doesn't really care about the cloud--the majority of speakers at SuperCompCloud were American.</p><h2 id=\"parting-thoughts\">Parting thoughts</h2><p>As I said at the outset, there were way more sessions that I missed than I attended. In addition, a lot of the big headlines of the week were coincident with, not made at, the conference. A few noteworthy announcements during the week that I won't go into detail about include:</p><ol type=\"1\"><li><a href=\"https://www.ed.ac.uk/news/university-set-to-host-ps750m-national-supercomputer\">£750M was awarded to EPCC</a> to deploy what sounds like the UK's first exascale system. This announcement's overlap with ISC was a total coincidence, so EPCC didn't have many details to share.</li><li><a href=\"https://ultraethernet.org/ultra-ethernet-consortium-uec-launches-specification-1-0-transforming-ethernet-for-ai-and-hpc-at-scale/\">The Ultra Ethernet Consortium announced the long-awaited version 1 of its spec</a>. I'm not sure how relevant this is to HPC yet, but given how many networking talks compared themselves against InfiniBand, I think there's a lot of appetite for a high-performance, non-proprietary alternative.</li><li>Sadly, <a href=\"https://www.hpcwire.com/2025/06/11/farwell-hpc-guru/\">HPC_Guru announced his retirement</a> mid-week as well. It's not clear this was deliberately timed with ISC, but it was acknowledged on the big stage during the ISC closing statements and resulted in a lot of <a href=\"https://bsky.app/profile/hpcguru.bsky.social/post/3lrcsdbwa522c\">recognition</a> <a href=\"https://x.com/hpc_guru/status/1932688759310725425?s=61\">online</a>. I credit HPC_Guru, whoever he is, with a lot of the success I've enjoyed in my career, as he amplified my voice as far back as 2009 when I first started on Twitter. Maybe with his retirement, I should try to do for others what he did for me.</li></ol><p>And along the lines of reflecting back over the years, this was ISC's 40th anniversary, and the organizers had a few wonderful features to commemorate the milestone. Addison Snell organized a panel where a variety of attendees got to discuss the impact that the conference has had on them over the past 40 years, and I was delighted to find that I was not the only person to <a href=\"https://glennklockwood.com/garden/ISC-conference#isc-40th-anniversary-panel\">reflect back on how ISC has shaped my career</a>. As critical as I can be of specific speakers and sessions when I write up these notes, I do hope it goes without saying that I wouldn't bother doing all this for a conference that wasn't deeply engaging and rewarding to be a part of.</p><p>Going back to this year's theme of connecting the dots, I think it's apt. Some ways in which HPC connected dots at ISC this year were obvious; the conference brought together people with a common interest in high-performance computing from across 54 countries and seven continents this year. But this year's conference also made it clear that the role of HPC going forward may be connecting the dots between different technologies being developed for AI, cloud, enterprise, and other markets and the problems in scientific computing that need to be solved.</p><p>The latest and greatest Blackwell GPUs barely registered at ISC this year, and the HPC community seems OK with that now. Instead of the focus being on the absolute top-end in high-performance accelerators, HPC's focus was on connecting the dots between last generation's GPUs and today's grand challenges in science. Instead of showcasing the newest innovations in secure computing in the cloud, HPC's focus was in connecting the dots between a few relevant pieces of zero trust and big-iron on-prem supercomputers.</p><p>HPC has always been about figuring out ways to use stuff invented for someone else to solve scientific challenges--connecting the dots. Beowulf clusters started that way, GPGPU computing started that way, and emulating DGEMMs (and other primitives) on AI accelerators will probably follow the same pattern. But different nations are drawing different lines between the dots; while the US might draw a shorter line between commercial cloud and HPC at scale, Europe is drawing shorter lines between HPC for scientific computing and HPC for sovereign AI.</p><p>If we accept that connecting the dots may be where the HPC community can make the most impact, then it's fitting that ISC chose to carry forward the theme of \"connecting the dots\" into ISC'26. This break from the tradition of introducing a new tagline each year suggests that, at times, optimizing what we already have can take us further than than pursuing something completely new. After 40 years, ISC remains not only a showcase of innovation, but a reflection of how the HPC community (and its role in the technology landscape) is evolving. If we continue to embrace this theme of stitching together breakthroughs instead of spotlighting them individually, the HPC community is likely to be more relevant than ever alongside--not in spite of--the overwhelming momentum of hyperscale and AI.</p>",
            "url": "https://hpc.social/personal-blog/2025/isc-25-recap/",
            
            
            
            
            
            "date_published": "2025-06-24T05:58:00-06:00",
            "date_modified": "2025-06-24T05:58:00-06:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/surfing-the-singularity-adventures-in-quantum-chemistry/",
            "title": "Surfing the Singularity- Adventures in Quantum Chemistry",
            "summary": null,
            "content_text": "&lt;div class=\"separator\" style=\"clear: both; text-align: center;\"&gt;&lt;/div&gt;&lt;div class=\"separator\" style=\"clear: both; text-align: center;\"&gt;&lt;/div&gt;In this installment of the Surfing the Singularity blog we go vlog, giving an overview of quantum computing today with application to chemistry. Quantum computing is rapidly advancing, with improvements in machine size, error correction, and scalability. And yet, there's always a desire to drive towards advancements and scientific applications which are just out of reach of today's technologies. New algorithms lead the way.&nbsp;In this video, will give a brief overview of quantum computing, what it means, where we are on the product roadmaps, and explore an emergent algorithm for pushing the boundaries of chemical modeling beyond what is possible with today's classical machines. Enjoy.&nbsp;- andy&nbsp;P.S. Begging your forgiveness for being a YouTube newb...&nbsp;&lt;p&gt;&lt;/p&gt;",
            "content_html": "<div class=\"separator\" style=\"clear: both; text-align: center;\"><br /></div><p><br />&lt;div class=\"separator\" style=\"clear: both; text-align: center;\"&gt;&lt;/div&gt;<br />&lt;div class=\"separator\" style=\"clear: both; text-align: center;\"&gt;<br />&lt;/div&gt;</p><p>In this installment of the Surfing the Singularity blog we go vlog, giving an overview of quantum computing today with application to chemistry. Quantum computing is rapidly advancing, with improvements in machine size, error correction, and scalability. And yet, there's always a desire to drive towards advancements and scientific applications which are just out of reach of today's technologies. New algorithms lead the way.&nbsp;</p><p>In this video, will give a brief overview of quantum computing, what it means, where we are on the product roadmaps, and explore an emergent algorithm for pushing the boundaries of chemical modeling beyond what is possible with today's classical machines. Enjoy.&nbsp;</p><p>- andy&nbsp;</p><p><br /></p><p>P.S. Begging your forgiveness for being a YouTube newb...&nbsp;</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><br /></div><p><br />&lt;p&gt;&lt;/p&gt;</p>",
            "url": "https://hpc.social/personal-blog/2025/surfing-the-singularity-adventures-in-quantum-chemistry/",
            
            
            
            
            
            "date_published": "2025-03-11T13:11:00-06:00",
            "date_modified": "2025-03-11T13:11:00-06:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/llm-training-without-a-parallel-file-system/",
            "title": "LLM training without a parallel file system",
            "summary": null,
            "content_text": "The illustrious Jeff Denworth recently posted a hot take across social media, claiming that training large language models (LLMs) doesn't require massive, expensive parallel file systems:As someone who's been working on one of the largest supercomputers on the planet--one that has no parallel file system at all--I was surprised by how many incredulous or curious responses followed. I guess supercomputers and parallel file systems are like peas and carrots in so many people's minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable.I've given talks about how LLM training uses storage in the past, but I realized I've never written it down. So, for the benefit of humankind, let's talk about how these supercomputers without parallel file systems work.The workloadThough the actual model training on giant GPU supercomputers gets all the attention, the full process of training an LLM is a little more involved. A colleague of mine at Microsoft gave a great overview of this storage-centric, end-to-end picture at SNIA SDC24; broadly, training an LLM involves the following steps:Data ingestion: This is where crawlers scrape the Internet and pull down raw html, images, videos, and other media. These raw data are indexed and shoved into a data warehouse. At scale, this can be hundreds or thousands of petabytes of data for frontier models.Data preparation: This is where the raw data is converted into tokenized data. It amounts to a huge data analytics problem that uses well-documented text and image processing pipelines that filter, deduplicate, and otherwise clean the raw garbage on the Internet using frameworks like Apache Spark. The hundreds of petabytes of input get reduced down by 10x-1000x.Model training: This is where the tokenized data is shoveled through the LLM on giant GPU clusters in little batches. As the data is processed, the model weights are updated, and those weights are checkpointed to storage. If a compute node crashes and the job fails, that checkpoint is used to restart, just like a traditional scientific HPC application. There might be fine-tuning and the like happening as part of this too, but I won't talk about that.Model deployment and inferencing: This is where the final model is copied across giant fields of inferencing servers, and a web service sits in front of it all to transform REST API requests into actual inferencing queries that run on the GPUs. This isn't training, but we'll talk about it anyway.To understand why a parallel file system offers no particular benefit to any of these steps, let's take a closer look at what's going on in each one.Data ingestionData ingestion is a widely distributed process that involves minimal computation; you just need a lot of Internet-facing network connectivity and CPU cores to drive independent processes connecting to other people's public HTTP servers. I don't know a lot about what this process looks like, because it never relies on anything resembling a supercomputer.To the best of my knowledge, data ingestion just pulls HTML, images, or video streams from the Internet and packs them into data containers. As it is packing webpages into these files, it is building a separate index that stores metadata about the webpage (URL, encoding, date of access) and its location (the file in which the webpage's contents are stored and the byte offset within that file). Thousands of VMs might be performing these tasks completely independently, and because they do not need to synchronize with each other at any step, it can be better to distribute these scrapers around the world rather than centralize all of them in a single datacenter.While one could store each scraped HTML page in a file that's organized in a parallel file system, accessing those files would be very slow--a full crawl of all the data would require scanning hundreds of billions of little files. So instead of implementing data containers using files and the index using a file system directory tree, it's better to implement data containers on top of object stores and use a distributed key-value store for the index. The fact that scraped data is write-once (and therefore doesn't need features like file locking or read-modify-write), is a natural fit for object stores' design around object immutability.Data preparationOnce raw data is indexed and saved in object stores, the first phase of computation comes into play. I've documented this data processing pipeline on my LLM training datasets page, but a lot of it amounts to running Apache Spark-like pipelines that chew through all the raw data in a trivially parallel way.These data processing pipelines are very well defined from the days when Hadoop was all the rage, and their data access patterns map well to the strengths of object stores. Each processing task might read a couple hundred megabytes of data from an object all at once, process it in-memory, then dump it back out to objects all at once. File systems offer no benefit here, because each task reads once and writes once rather than skipping around inside individual objects.There is a significant compute workload here, and there are points in the data processing pipeline where global synchronization of all tasks is required. Specifically, the process of deduplicating input data--which is a critical step to getting a high-quality model these days--requires comparing every piece of data to every other piece of data. As a result, this data preparation phase is often done in a centralized location that is adjacent the object store containing all the raw data scraped from the previous step. The clusters used for data processing can resemble traditional CPU-based supercomputers (think a system like TACC's Frontera), and in some cases, they might even have full RDMA fabrics to accelerate the all-to-all deduplication step.Critically, this data processing step is not done on the GPU nodes that actually train the model. Data processing is usually limited by I/O bandwidth to storage, and you never want your GPUs stalling out because they're waiting for data. Parallel file system vendors might tell you that the only way to avoid this GPU starvation issue is to plug every GPU node into a super-fast parallel file system, but the reality is that people just do this I/O-heavy step on completely separate supercomputers before training on GPUs ever begins.CPU nodes are significantly cheaper than GPUs, so buying cheap object storage and a cheap CPU cluster is more cost-effective than buying an expensive file system and wasting your GPU nodes on trivially parallel text processing tasks. To illustrate this, consider some normalized list prices from Azure:$1.00 gets you a 96-core general-purpose VM with 384 GB of RAM$1.65 gets you a 176-core HPC-optimized VM with NDR InfiniBand and 768 GB of RAM$22.55 gets you a 96-core, 8x H100 GPU VM with NDR InfiniBandGiven that GPUs don't give you a 13x-22x speedup for data processing despite the 13x-22x the price, it makes no sense to perform this data processing on GPU nodes inline with training.One could argue that the GPUs are sitting idle while the data processing cluster is working anyway, but rest assured that AI model shops have no shortage of work to keep their GPUs busy. Data processing for the next model on a CPU cluster often happens at the same time the current model is being trained on the GPU cluster. In cases where there isn't enough work to keep both CPU and GPU clusters busy around the clock, also remember that most of this stuff happens in the cloud, and cloud providers can sell those idle CPU or GPU cycles to another customer in between training campaigns.Model trainingHuge, distributed training jobs are where most people would think a fast parallel file system is required for both reading input data and writing out checkpoints. After all, the need for fast checkpointing and restart were the primary driver behind the creation of parallel file systems.While parallel file systems certainly can be used for training, they are not the most cost-effective or scalable way to train across tens of thousands of GPUs. To better illustrate the reasons why this is, let's consider the processes of reading inputs and writing checkpoints separately.Reading training dataTraining a model on GPUs, whether it be on one or a thousand nodes, follows a simple cycle (this is a \"step\" in LLM training parlance) that's repeated over and over:A batch of tokenized data is loaded into GPU memoryThat data is then processed through the neural network and the model weights are adjustedAll GPUs synchronize their updated weightsIt's tempting to imagine the I/O load generated by step #1 as being the same as it would be for a traditional HPC job: data is read from a parallel file system into compute memory at the start of every single step:In years past, storage vendors would've insisted that this repeated, random re-reading of input data at every step requires a super-fast parallel file system to keep up. However, two factors make that untrue:The input data isn't millions of little text or image files. As described in the data ingest and data processing steps, these small files are packaged into large objects before the GPUs ever see them.Tokenized data is very dense compared to raw input, so the amount of bytes being read over the course of hundreds or thousands of steps is actually quite small.To quantify #2, consider the Llama-3 405b model, which was trained on a significant fraction of the public Internet--15.6 trillion tokens. That sounds like a lot of information until you realize that the size of a typical token is between 3 and 5 bytes depending on the tokenizer and encoding. This means that the entire 405-billion parameter Llama-3 model, which was trained using 16,000 GPUs, only had to load 60 TB of tokens from storage. That divides out to 3.75 GB of tokens processed by each GPU over the entire course of a 54-day run.When you consider how few bytes are required to train an LLM, it should become clear that the biggest I/O challenge in the performance-critical training loop isn't raw bandwidth; it's performance variability. As such, the best way to ensure that GPUs do not stall out due to read requests is to eliminate as much I/O performance variability as possible. To do this, you have to minimize the sources of contention that might arise between the storage devices and the network that connects them to the GPUs. While you can do this using sophisticated quality-of-service in both the storage servers and interconnect, there is an easier way.Just stick some local SSDs in every GPU node.This ensures that no contention will occur when loading data from storage into the GPU, because the only network between them is the PCIe on the node. In addition, using node-local NVMe allows storage capacity and storage performance to scale linearly with GPU performance. By comparison, a remote storage system (whether it be parallel file or object) won't get any bigger or faster as you add more GPUs to the training job, resulting in each GPU losing efficiency due to I/O as more GPUs are added to the training job.In practice, model training uses local SSDs like this:At the start of a training job, data is read from remote storage into the local SSDs in a distributed fashion once. Because the tokenized data is so small, many replicas of the entire dataset can be stored across the job's GPU nodes as well; for example, if you were to train Llama-3 405b on NVIDIA DGX H100 nodes, you could fit the entire training dataset (all 60 TB of it) on just three nodes since each node comes with 30 TB of local SSD. Given that the model was trained on 16,000 GPUs (2,000 nodes), that translates to storing hundreds of replicas of the entire training set. This has a few major benefits:GPUs never have to wait for shared storage to return data before they can compute. Everything they need is on the local SSDs.When a GPU node fails, its input data can be recovered from a surviving GPU node over the backend InfiniBand. After training starts, input data never has to be read from shared storage again.It's common to scale up training over time by adding more GPUs (more data-parallel domains) to the job as it stabilizes. When this happens, I/O performance scales linearly because these new GPUs never have to fight over shared storage.A reasonable critique of this approach is that data management becomes more complicated; either the training framework has to keep track of which SSDs and nodes have copies of which input data, or a distributed, client-side shared namespace like WEKA Converged Mode or CoreWeave LOTA has to sit between your application and your data. In practice though, frontier models are trained for exactly one epoch; that is, every input token is processed exactly one time to achieve optimal model quality. Because no two GPUs will ever need to read the same input token, there's never a need to copy input tokens between nodes inside the training loop. I also acknowledge that the above description is greatly simplified; the entire node-local SSD capacity cannot be filled with input data, as space is also needed for checkpoints and other temporary data. However, the fact remains that super high-bandwidth or super high-capacity parallel file systems are not necessary for loading input tokens during training. AI training clusters are built with a ton of local SSDs to do the heavy lifting, and the input data for LLMs is small enough to fit in just a handful of GPU nodes.Writing model checkpointsThough the read workload of LLM training is modest at best, the write workload can be quite intense at scale because the probability of failure increases superlinearly with the size of the training job. However, unlike with scientific HPC jobs, the checkpoint size does not scale as a function of the job size; the checkpoint for a 405 billion-parameter model trained on 16,000 nodes is the same size as the checkpoint for that model trained on three nodes. This is a result of the fact that every training step is followed by a global synchronization which makes each data-parallel copy of the model identical. Only one copy of those model weights, which amounts to under a hundred terabytes for state-of-the-art LLMs, needs to be saved: Kartik and Colleen Tartow at VAST wrote a quantitative breakdown of the true I/O requirements of checkpointing, and they illustrate how even a trillion-parameter model can achieve 99.7% forward progress (only 0.3% time spent checkpointing) when training across 3,072 GPUs with a modest 273 GB/s file system. A parallel file system is not required to get that level of performance; for example, HDD-based Azure Blob achieved over 1 TB/s when benchmarked with IOR for writes at scale.As with reading input tokens though, the real goal for checkpointing at scale is to remove any dependence on shared storage from the training loop entirely. And again, the best way to do this is to simply checkpoint to node-local storage. However, special care must be taken to ensure that the checkpoints don't get lost when a node crashes.In practice, LLM training is now done with asynchronous, multilevel checkpointing. This technique provides the scalability of checkpointing to node-local storage and the durability of shared storage:The key to this checkpointing process is hierarchical data synchronization:Model weights are first copied from GPU memory into the node's CPU memory after every training step. This checkpoint is governed by the CPU-GPU bandwidth (either PCIe or NVLink/Infinity Fabric), and a 500 GB checkpoint can complete in a second. The benefit of checkpointing to DRAM is that the GPU can unblock and begin computing the next step very quickly. However, this checkpoint in DRAM is not protected and will be lost if the node crashes.To protect against node crashes, the checkpoint is then asynchronously copied from CPU DRAM to a neighbor node's local SSD using RDMA. Now if a node crashes, it can restore from a checkpoint that is stored on its neighboring node's SSD via InfiniBand. Reading and writing a 500 GB checkpoint to neighboring SSDs might take ten seconds, so this asynchronous replication might be done for every tenth DRAM checkpoint.To store many checkpoints long-term, checkpoints are also asynchronously copied from node-local SSD to shared storage. This might take a minute or two per 500 GB checkpoint, so this last-level checkpoint copy might be done once every ten minutes.This hierarchical checkpointing scheme allows the GPUs to spend only a second checkpointing while being able to recover from job, node, and even cluster-level failures by tailoring the checkpoint tiering frequencies to the performance of each storage tier being used. The cost of recovering from a catastrophic failure might be re-computing up to ten minutes worth of training, but given the rarity of such events, this scheme balances the performance (and risks) of checkpointing to DRAM against hard drive prices (and suffering their performance) for a durable object store.To this latter point, the requirements of the shared storage system at the bottom of this checkpointing hierarchy are very modest:The checkpoint only needs to complete in the time between successive last-level checkpoint copies. If the 500 GB checkpoint is drained to shared storage only once every ten minutes, our shared storage only needs to deliver 1 GB/s of total bandwidth.The write pattern from node-local NVMe to shared storage is arbitrary, because it is a simple copy operation of a fully formed checkpoint file. Unlike direct-to-storage checkpoints, there are no weirdly shaped tensors being serialized into a file on the fly; rather, opaque bits are streaming from a local checkpoint file into a remote object using whatever transfer size and parallelism gives the highest write bandwidth.This combination of modest write bandwidth and simple, sequential, large-block writes is ideally suited for object stores. This isn't to say a parallel file system cannot work here, but this checkpointing scheme does not benefit from directory structure, fine-grained consistency semantics, or any of the other complexities that drive up the cost of parallel file systems.The catch, of course, is that checkpointing using these schemes can be complicated to implement. Fortunately, a growing number of training frameworks support both writing and restoring checkpoints using asynchronous and hierarchical approaches. Model developers never have to worry about interacting with specific files or objects; instead, the framework manages data locality during checkpoint and restart underneath a high-level API.Model deployment and inferencingOnce a model is trained, putting it into production as an inferencing service is the final step of its lifecycle. From a storage and I/O standpoint, this is a lot more complicated than training because it marries an enterprise service delivery model (failover, load balancing, authentication, and scaling) with copies of a trained model running across HPC infrastructure. When you hear vendors talking about key-value stores, vector databases, and RAG, that is all happening at this stage.Setting aside everything but the storage attached to the GPU cluster though, the I/O requirements of inferencing are relatively straightforward:When provisioning a GPU node for inferencing, model weights must be loaded from shared storage as fast as possible.When using an LLM to search documents, a vector database is required to perform the similarity search that augments the LLM query with the relevant documents. This is the basis for RAG.Key-value caches are often used to reduce the latency for different parts of the inferencing pipeline by storing context including the conversation or frequently accessed contextual documents.As the inferencing demand evolves, different models and weights may be swapped in and out of individual GPU servers.A parallel file system is not particularly useful for any of these; the only place in which their high bandwidth would be a benefit is in loading and re-loading model weights (#1 and #4). But as with hierarchical checkpointing, those I/O operations are whole-object, read-only copies that are a natural fit for object APIs. Complex directory structures and strong consistency simply aren't necessary here.Objects are good enough, maybe betterNone of the steps in this model training lifecycle uniquely benefit from the capabilities that parallel file systems offer:Data ingestion involves hundreds of petabytes of small documents, but they are immediately packaged and indexed into large data containers. Their metadata is stored in a separate key-value store, so the directory hierarchy of a file system isn't used, and once data has been packaged and indexed, it's never modified in-place. The bandwidth requirements are modest as well since web crawling is the rate-limiting step.Data processing is an I/O-intensive data analytics workload. Read bandwidth is critical here, but data is accessed in large transactions and most of the computation is embarrassingly parallel. This workload runs on standalone analytics clusters, so even though the read bandwidth here is rate-limiting, slower storage is not going to impact GPU utilization on training clusters in any way. This step also reduces data by 100x or more, so the write requirements are also modest.Training requires both loading input tokens and checkpointing model weights. However, both of these workloads lean on node-local NVMe in every node to eliminate slowdowns due to noisy neighbors. Input data is staged to node-local storage only once at the beginning of a training campaign, and checkpoints are asynchronously bled out to shared storage without impacting GPU utilization.Inferencing involves infrequent, read-only, bulk loading of model weights into GPU nodes. While key-value caches and vector databases are also used in inferencing, parallel file systems offer no particular benefit for them.The I/O patterns of each of these steps map nicely to object storage since they are predominantly write-once and whole-file transactions. Parallel file systems certainly can be used, and workloads will benefit from the high bandwidth they offer. However, they come with the cost of features that aren't necessary--either literal costs (in the case of appliances or proprietary software) or figurative costs (allocating people to manage the complexities of debugging a parallel file system).The importance of this latter point is hard to appreciate if you've never used a supercomputer without a parallel file systems. However, I recently sat in on the validation of a brand-new H200 training cluster where various InfiniBand congestion and routing issues were being worked out. It wasn't until someone said \"eviction\" in some nontechnical context that I realized that the sporadic file system evictions during fabric instability were simply a non-issue. There was no cleanup of mount points after major fabric events because there was no persistent, fragile client-server state being maintained. I/Os between GPU nodes or nodes and storage might have failed during a rough patch, but they recovered and resumed on their own as soon as the fabric came back. Similarly, identity didn't matter, and all tests could be run as root because there was no implicit trust between the client kernel and remote storage. Removing the dependence between compute nodes, LDAP, and healthy file system mounts completely eliminates many of the challenges of standing up new clusters quickly.An ideal AI training cluster architectureThe workloads I described above form a rough outline for an AI training infrastructure which has:A bunch of GPU nodes with a strong RDMA backend like InfiniBand. Each node should have at least enough node-local SSD to store a substantial amount of the input tokens to be used for training, enough space for hierarchical checkpointing, and enough I/O bandwidth to these SSDs to support draining checkpoints from partner nodes' DRAM in just a few seconds. A separate frontend network that connects to storage is also a good idea; it ensures that asynchronous checkpoint draining won't interfere with weight synchronization in the training loop.A separate CPU cluster for data processing pipelines. A strong backend network will benefit the deduplication step (which is critical to producing high-quality training datasets), but more emphasis should be placed on optimizing large-transaction reads from storage. Given that CPU nodes are so much cheaper than GPU nodes, separating the data processing nodes from training nodes allows you cut more corners when optimizing this CPU cluster. Keeping data processing out-of-band of actual model training means your most data-intensive step (data processing) is decoupled from your most expensive step (training).A scalable object store that supports basic write-once semantics with modest I/O bandwidth at scale. This matches the needs of the workloads with the price-performance of the storage system and simplifies the recovery process if the interconnect between compute and storage gets congested. It can also serve the data needs of all stages of the training pipeline: hundreds of petabytes of raw training data, hundreds of terabytes of input tokens, and tens of terabytes of model weights all have similar performance needs and can be stored on the same infrastructure with the appropriate QOS settings.A pool of general-purpose compute infrastructure for hosting the raw training data indices. This can also be used to support vector databases, raw context documents for RAG, and any other ancillary services required for production inferencing.By eschewing a high-performance parallel file system and localizing I/O performance to inside the GPU cluster with node-local NVMe, a vanilla network between the GPU cluster and the other subsystems is sufficient. Although less high-performance, these non-critical bits (ideally) have lower complexity, maintenance, and supportability as well, allowing (again, ideally) more resources to be sloshed towards supporting the high-value GPU infrastructure.Incidentally, this architecture happens to be how most of the largest AI training clusters on which I work are designed.But parallel files aren't all badOf course, having no parallel file system presents some usability challenges if users are expecting to be able to SSH into a login node and have a complete user environment ready. The user experience for the above infrastructure works best for those who are comfortable developing software in containers and launching pods rather than developing software in vim and submitting Slurm jobs. I do not advocate for throwing out parallel file systems if they're already ingrained in users' workflows!In addition, the latest crop of modern, distributed file systems all now support multi-protocol data access. For example, WEKA, VAST, and Qumulo, all support S3 (object) interfaces as first-class citizens. Users who want the traditional HPC experience can play with their data using a file mount as they always have, while those who are coming in from the cloud-native side have equal access to those same data as objects. Supporting multiprotocol access to data in AI environments doesn't reduce the need to overbuild infrastructure or support stateful file mounts across all compute nodes, but it does provide an onramp for users to get comfortable moving away from the traditional HPC user experience.Finally, a few of the leading-edge parallel-file-system-turned-AI-storage platforms are also shipping features that make them valuable for the deployment and inferencing part of the lifecycle. For example, WEKA has their WARRP reference architecture for RAG, and VAST has its InsightEngine--both use the unique architectures underneath their file interfaces to accelerate vector queries far beyond what you would get from running a vector database on, say, Lustre. These so-called \"AI data platforms,\" despite starting as parallel file systems, are spreading their relevance out to the entire LLM lifecycle, filling needs for file, object, and structured data with a single storage system.This is all to say that parallel file systems aren't bad, and they aren't going anywhere. But they aren't required to train frontier models either, and as I've tried to describe above, some of the largest supercomputers on the planet are designed not to require them.",
            "content_html": "<p>The illustrious Jeff Denworth recently posted a hot take across social media, claiming that training large language models (LLMs) doesn't require massive, expensive parallel file systems:</p><p></p><p><br /></p><p>As someone who's been working on <a href=\"https://glennklockwood.com/garden/systems/Eagle\">one of the largest supercomputers on the planet</a>--one that has no parallel file system at all--I was surprised by how many incredulous or curious responses followed. I guess supercomputers and parallel file systems are like peas and carrots in so many people's minds that the idea of being able to run a massive parallel compute job without a massive parallel file system is so unintuitive that it is unbelievable.</p><p>I've given talks about how LLM training uses storage in the past, but I realized I've never written it down. So, for the benefit of humankind, let's talk about how these supercomputers without parallel file systems work.<span></span></p><p></p><div class=\"separator\" style=\"clear: both; display: none; text-align: center;\"></div><h2 style=\"text-align: left;\">The workload</h2><p>Though the actual model training on giant GPU supercomputers gets all the attention, the full process of training an LLM is a little more involved. A colleague of mine at Microsoft gave <a href=\"https://www.sniadeveloper.org/events/agenda/session/670\">a great overview of this storage-centric, end-to-end picture at SNIA SDC24</a>; broadly, training an LLM involves the following steps:</p><p></p><ol style=\"text-align: left;\"><li><b>Data ingestion</b>: This is where crawlers scrape the Internet and pull down raw html, images, videos, and other media. These raw data are indexed and shoved into a data warehouse. At scale, this can be hundreds or thousands of petabytes of data for <a href=\"https://glennklockwood.com/garden/frontier-model\">frontier models</a>.</li><li><b>Data preparation</b>: This is where the raw data is converted into tokenized data. It amounts to a huge data analytics problem that uses well-documented text and image processing pipelines that filter, deduplicate, and otherwise clean the raw garbage on the Internet using frameworks like Apache Spark. The hundreds of petabytes of input get reduced down by 10x-1000x.</li><li><b>Model training</b>: This is where the tokenized data is shoveled through the LLM on giant GPU clusters in little batches. As the data is processed, the model weights are updated, and those weights are checkpointed to storage. If a compute node crashes and the job fails, that checkpoint is used to restart, just like a traditional scientific HPC application. There might be fine-tuning and the like happening as part of this too, but I won't talk about that.</li><li><b>Model deployment and inferencing</b>: This is where the final model is copied across giant fields of inferencing servers, and a web service sits in front of it all to transform REST API requests into actual inferencing queries that run on the GPUs. This isn't training, but we'll talk about it anyway.</li></ol><p style=\"text-align: left;\">To understand why a parallel file system offers no particular benefit to any of these steps, let's take a closer look at what's going on in each one.</p><h3 style=\"text-align: left;\">Data ingestion</h3><p style=\"text-align: left;\">Data ingestion is a widely distributed process that involves minimal computation; you just need a lot of Internet-facing network connectivity and CPU cores to drive independent processes connecting to other people's public HTTP servers. I don't know a lot about what this process looks like, because it never relies on anything resembling a supercomputer.</p><p style=\"text-align: left;\">To the best of my knowledge, data ingestion just pulls HTML, images, or video streams from the Internet and packs them into <i>data containers</i>. As it is packing webpages into these files, it is building a separate <i>index</i> that stores metadata about the webpage (URL, encoding, date of access) and its location (the file in which the webpage's contents are stored and the byte offset within that file). Thousands of VMs might be performing these tasks completely independently, and because they do not need to synchronize with each other at any step, it can be better to distribute these scrapers around the world rather than centralize all of them in a single datacenter.</p><p style=\"text-align: left;\">While one <i>could</i> store each scraped HTML page in a file that's organized in a parallel file system, accessing those files would be very slow--a full crawl of all the data would require scanning hundreds of billions of little files. So instead of implementing <i>data containers</i> using files and the <i>index</i> using a file system directory tree, it's better to implement data containers on top of object stores and use a distributed key-value store for the index. The fact that scraped data is write-once (and therefore doesn't need features like file locking or read-modify-write), is a natural fit for object stores' design around object immutability.</p><h3 style=\"text-align: left;\">Data preparation</h3><p style=\"text-align: left;\">Once raw data is indexed and saved in object stores, the first phase of computation comes into play. I've documented this data processing pipeline on my <a href=\"https://glennklockwood.com/garden/LLM-training-datasets#computational-requirements\">LLM training datasets page</a>, but a lot of it amounts to running Apache Spark-like pipelines that chew through all the raw data in a trivially parallel way.</p><p style=\"text-align: left;\">These data processing pipelines are very well defined from the days when Hadoop was all the rage, and their data access patterns map well to the strengths of object stores. Each processing task might read a couple hundred megabytes of data from an object all at once, process it in-memory, then dump it back out to objects all at once. File systems offer no benefit here, because each task reads once and writes once rather than skipping around inside individual objects.</p><p style=\"text-align: left;\">There is a significant compute workload here, and there are points in the data processing pipeline where global synchronization of all tasks is required. Specifically, the process of deduplicating input data--which is <a href=\"https://arxiv.org/abs/2107.06499\">a critical step to getting a high-quality model these days</a>--requires comparing every piece of data to every other piece of data. As a result, this data preparation phase is often done in a centralized location that is adjacent the object store containing all the raw data scraped from the previous step. The clusters used for data processing can resemble traditional CPU-based supercomputers (think a system like <a href=\"https://tacc.utexas.edu/systems/frontera/\">TACC's Frontera</a>), and in some cases, they might even have full RDMA fabrics to accelerate the all-to-all deduplication step.</p><p style=\"text-align: left;\">Critically, this data processing step is not done on the GPU nodes that actually train the model. Data processing is usually limited by I/O bandwidth to storage, and you never want your GPUs stalling out because they're waiting for data. Parallel file system vendors might tell you that the only way to avoid this GPU starvation issue is to plug every GPU node into a super-fast parallel file system, but the reality is that people just do this I/O-heavy step on completely separate supercomputers before training on GPUs ever begins.</p><p style=\"text-align: left;\">CPU nodes are significantly cheaper than GPUs, so buying cheap object storage and a cheap CPU cluster is more cost-effective than buying an expensive file system and wasting your GPU nodes on trivially parallel text processing tasks. To illustrate this, consider some normalized list prices from Azure:</p><p style=\"text-align: left;\"></p><ul style=\"text-align: left;\"><li>$1.00 gets you a 96-core general-purpose VM with 384 GB of RAM</li><li>$1.65 gets you a 176-core HPC-optimized VM with NDR InfiniBand and 768 GB of RAM</li><li>$22.55 gets you a 96-core, 8x H100 GPU VM with NDR InfiniBand</li></ul><div>Given that GPUs don't give you a 13x-22x speedup for data processing despite the 13x-22x the price, it makes no sense to perform this data processing on GPU nodes inline with training.</div><p></p><p style=\"text-align: left;\">One could argue that the GPUs are sitting idle while the data processing cluster is working anyway, but rest assured that AI model shops have no shortage of work to keep their GPUs busy. Data processing for the next model on a CPU cluster often happens at the same time the current model is being trained on the GPU cluster. In cases where there isn't enough work to keep both CPU and GPU clusters busy around the clock, also remember that most of this stuff happens in the cloud, and cloud providers can sell those idle CPU or GPU cycles to another customer in between training campaigns.</p><h3 style=\"text-align: left;\">Model training</h3><p style=\"text-align: left;\">Huge, distributed training jobs are where most people would think a fast parallel file system is required for both reading input data and writing out checkpoints. After all, the need for fast checkpointing and restart were the primary driver behind the creation of parallel file systems.</p><p style=\"text-align: left;\">While parallel file systems certainly <i>can</i> be used for training, they are not the most cost-effective or scalable way to train across tens of thousands of GPUs. To better illustrate the reasons why this is, let's consider the processes of reading inputs and writing checkpoints separately.</p><h4 style=\"text-align: left;\">Reading training data</h4><p style=\"text-align: left;\">Training a model on GPUs, whether it be on one or a thousand nodes, follows a simple cycle (this is a \"step\" in LLM training parlance) that's repeated over and over:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>A batch of tokenized data is loaded into GPU memory</li><li>That data is then processed through the neural network and the model weights are adjusted</li><li>All GPUs synchronize their updated weights</li></ol><p style=\"text-align: left;\">It's tempting to imagine the I/O load generated by step #1 as being the same as it would be for a traditional HPC job: data is read from a parallel file system into compute memory at the start of every single step:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">In years past, storage vendors would've insisted that this repeated, random re-reading of input data at every step requires a super-fast parallel file system to keep up. However, two factors make that untrue:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>The input data isn't millions of little text or image files. As described in the data ingest and data processing steps, these small files are packaged into large objects before the GPUs ever see them.</li><li>Tokenized data is very dense compared to raw input, so the amount of bytes being read over the course of hundreds or thousands of steps is actually quite small.</li></ol><p></p><p style=\"text-align: left;\">To quantify #2, consider the <a href=\"https://arxiv.org/abs/2407.21783\">Llama-3 405b model</a>, which was trained on a significant fraction of the public Internet--15.6 <i>trillion</i> tokens. That sounds like a lot of information until you realize that <a href=\"https://glennklockwood.com/garden/LLM-training-datasets#tokenized-data\">the size of a typical token is between 3 and 5 bytes</a> depending on the tokenizer and encoding. This means that the entire 405-billion parameter Llama-3 model, which was trained using 16,000 GPUs, only had to load 60 TB of tokens from storage. That divides out to 3.75 GB of tokens processed by each GPU over the entire course of a 54-day run.</p><p style=\"text-align: left;\">When you consider how few bytes are required to train an LLM, it should become clear that the biggest I/O challenge in the performance-critical training loop isn't raw bandwidth; it's performance variability. As such, the best way to ensure that GPUs do not stall out due to read requests is to eliminate as much I/O performance variability as possible. To do this, you have to minimize the sources of contention that might arise between the storage devices and the network that connects them to the GPUs. While you <i>can</i> do this using sophisticated quality-of-service in both the storage servers and interconnect, there is an easier way.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">Just stick some local SSDs in every GPU node.</p><p style=\"text-align: left;\">This ensures that no contention will occur when loading data from storage into the GPU, because the only network between them is the PCIe on the node. In addition, using node-local NVMe allows storage capacity and storage performance to scale linearly with GPU performance. By comparison, a remote storage system (whether it be parallel file or object) won't get any bigger or faster as you add more GPUs to the training job, resulting in each GPU losing efficiency due to I/O as more GPUs are added to the training job.</p><p style=\"text-align: left;\">In practice, model training uses local SSDs like this:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">At the start of a training job, data is read from remote storage into the local SSDs in a distributed fashion <i>once</i>. Because the tokenized data is so small, many replicas of the entire dataset can be stored across the job's GPU nodes as well; for example, if you were to train Llama-3 405b on NVIDIA DGX H100 nodes, <b>you could fit the entire training dataset (all 60 TB of it) on just three nodes</b> since each node comes with 30 TB of local SSD. Given that the model was trained on 16,000 GPUs (2,000 nodes), that translates to storing hundreds of replicas of the entire training set. This has a few major benefits:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>GPUs never have to wait for shared storage to return data before they can compute. Everything they need is on the local SSDs.</li><li>When a GPU node fails, its input data can be recovered from a surviving GPU node over the backend InfiniBand. After training starts, input data never has to be read from shared storage again.</li><li>It's common to scale up training over time by adding more GPUs (more data-parallel domains) to the job as it stabilizes. When this happens, I/O performance scales linearly because these new GPUs never have to fight over shared storage.</li></ol><p></p><p style=\"text-align: left;\">A reasonable critique of this approach is that data management becomes more complicated; either the training framework has to keep track of which SSDs and nodes have copies of which input data, or a distributed, client-side shared namespace like <a href=\"https://www.weka.io/resources/solution-brief/weka-data-platform-converged-mode/\">WEKA Converged Mode</a> or <a href=\"https://docs.coreweave.com/docs/products/storage/object-storage/concepts/lota\">CoreWeave LOTA</a> has to sit between your application and your data. In practice though, frontier models are trained for exactly one epoch; that is, <a href=\"https://glennklockwood.com/garden/scaling-laws#applying-scaling-laws\">every input token is processed exactly one time to achieve optimal model quality</a>. Because no two GPUs will ever need to read the same input token, there's never a need to copy input tokens between nodes inside the training loop. </p><p style=\"text-align: left;\">I also acknowledge that the above description is greatly simplified; the entire node-local SSD capacity cannot be filled with input data, as space is also needed for checkpoints and other temporary data. However, the fact remains that super high-bandwidth or super high-capacity parallel file systems are not necessary for loading input tokens during training. AI training clusters are built with a ton of local SSDs to do the heavy lifting, and the input data for LLMs is small enough to fit in just a handful of GPU nodes.</p><h4 style=\"text-align: left;\">Writing model checkpoints</h4><p style=\"text-align: left;\">Though the read workload of LLM training is modest at best, the write workload can be quite intense at scale because the probability of failure increases superlinearly with the size of the training job. However, unlike with scientific HPC jobs, <b>the checkpoint size does not scale as a function of the job size</b>; the checkpoint for a 405 billion-parameter model trained on 16,000 nodes is the same size as the checkpoint for that model trained on three nodes. This is a result of the fact that every training step is followed by a global synchronization which makes each data-parallel copy of the model identical. Only one copy of those model weights, which amounts to under a hundred terabytes for state-of-the-art LLMs, needs to be saved:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><span style=\"text-align: left;\"> </span></div><p style=\"text-align: left;\">Kartik and Colleen Tartow at VAST wrote <a href=\"https://www.vastdata.com/blog/a-checkpoint-on-checkpoints-in-llms\">a quantitative breakdown of the true I/O requirements of checkpointing</a>, and they illustrate how even a trillion-parameter model can achieve 99.7% forward progress (only 0.3% time spent checkpointing) when training across 3,072 GPUs with a modest 273 GB/s file system. A parallel file system is not required to get that level of performance; for example, HDD-based <a href=\"https://x.com/glennklockwood/status/1795548752628867132\">Azure Blob achieved over 1 TB/s when benchmarked with IOR</a> for writes at scale.</p><p style=\"text-align: left;\">As with reading input tokens though, the real goal for checkpointing at scale is to remove any dependence on shared storage from the training loop entirely. And again, the best way to do this is to simply checkpoint to node-local storage. However, special care must be taken to ensure that the checkpoints don't get lost when a node crashes.</p><p style=\"text-align: left;\">In practice, LLM training is now done with asynchronous, multilevel checkpointing. This technique provides the scalability of checkpointing to node-local storage and the durability of shared storage:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">The key to this checkpointing process is hierarchical data synchronization:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li><b>Model weights are first copied from GPU memory into the node's CPU memory</b> after every training step. This checkpoint is governed by the CPU-GPU bandwidth (either PCIe or NVLink/Infinity Fabric), and a 500 GB checkpoint can complete in a second. The benefit of checkpointing to DRAM is that the GPU can unblock and begin computing the next step very quickly. However, this checkpoint in DRAM is not protected and will be lost if the node crashes.</li><li>To protect against node crashes, the <b>checkpoint is then asynchronously copied from CPU DRAM to a neighbor node's local SSD</b> using RDMA. Now if a node crashes, it can restore from a checkpoint that is stored on its neighboring node's SSD via InfiniBand. Reading and writing a 500 GB checkpoint to neighboring SSDs might take ten seconds, so this asynchronous replication might be done for every tenth DRAM checkpoint.</li><li>To store many checkpoints long-term, <b>checkpoints are also asynchronously copied from node-local SSD to shared storage</b>. This might take a minute or two per 500 GB checkpoint, so this last-level checkpoint copy might be done once every ten minutes.</li></ol><p style=\"text-align: left;\">This hierarchical checkpointing scheme allows the GPUs to spend only a second checkpointing while being able to recover from job, node, and even cluster-level failures by tailoring the checkpoint tiering frequencies to the performance of each storage tier being used. The cost of recovering from a catastrophic failure might be re-computing up to ten minutes worth of training, but given the rarity of such events, this scheme balances the performance (and risks) of checkpointing to DRAM against hard drive prices (and suffering their performance) for a durable object store.</p><p style=\"text-align: left;\">To this latter point, the requirements of the shared storage system at the bottom of this checkpointing hierarchy are very modest:</p><p style=\"text-align: left;\"></p><ul style=\"text-align: left;\"><li>The checkpoint only needs to complete in the time between successive last-level checkpoint copies. If the 500 GB checkpoint is drained to shared storage only once every ten minutes, our shared storage only needs to deliver 1 GB/s of total bandwidth.</li><li>The write pattern from node-local NVMe to shared storage is arbitrary, because it is a simple copy operation of a fully formed checkpoint file. Unlike direct-to-storage checkpoints, there are no weirdly shaped tensors being serialized into a file on the fly; rather, opaque bits are streaming from a local checkpoint file into a remote object using whatever transfer size and parallelism gives the highest write bandwidth.</li></ul><p>This combination of modest write bandwidth and simple, sequential, large-block writes is ideally suited for object stores. This isn't to say a parallel file system cannot work here, but this checkpointing scheme does not benefit from directory structure, fine-grained consistency semantics, or any of the other complexities that drive up the cost of parallel file systems.</p><p>The catch, of course, is that checkpointing using these schemes can be complicated to implement. Fortunately, a <a href=\"https://www.linkedin.com/posts/jeffreydenworth_reducing-model-checkpointing-times-by-over-activity-7289273345269800960-zBj7\">growing number of training frameworks</a> support both writing and restoring checkpoints using asynchronous and hierarchical approaches. Model developers never have to worry about interacting with specific files or objects; instead, the framework manages data locality during checkpoint and restart underneath a high-level API.</p><h3 style=\"text-align: left;\">Model deployment and inferencing</h3><p style=\"text-align: left;\">Once a model is trained, putting it into production as an inferencing service is the final step of its lifecycle. From a storage and I/O standpoint, this is a lot more complicated than training because it marries an enterprise service delivery model (failover, load balancing, authentication, and scaling) with copies of a trained model running across HPC infrastructure. When you hear vendors talking about key-value stores, vector databases, and RAG, that is all happening at this stage.</p><p style=\"text-align: left;\">Setting aside everything but the storage attached to the GPU cluster though, the I/O requirements of inferencing are relatively straightforward:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>When provisioning a GPU node for inferencing, model weights must be loaded from shared storage as fast as possible.</li><li>When using an LLM to search documents, a vector database is required to perform the similarity search that augments the LLM query with the relevant documents. This is the basis for RAG.</li><li>Key-value caches are often used to reduce the latency for different parts of the inferencing pipeline by storing context including the conversation or frequently accessed contextual documents.</li><li>As the inferencing demand evolves, different models and weights may be swapped in and out of individual GPU servers.</li></ol><p style=\"text-align: left;\">A parallel file system is not particularly useful for any of these; the only place in which their high bandwidth would be a benefit is in loading and re-loading model weights (#1 and #4). But as with hierarchical checkpointing, those I/O operations are whole-object, read-only copies that are a natural fit for object APIs. Complex directory structures and strong consistency simply aren't necessary here.</p><h2 style=\"text-align: left;\">Objects are good enough, maybe better</h2><p style=\"text-align: left;\">None of the steps in this model training lifecycle uniquely benefit from the capabilities that parallel file systems offer:</p><p style=\"text-align: left;\"></p><ul style=\"text-align: left;\"><li>Data ingestion involves hundreds of petabytes of small documents, but they are immediately packaged and indexed into large data containers. Their metadata is stored in a separate key-value store, so the directory hierarchy of a file system isn't used, and once data has been packaged and indexed, it's never modified in-place. The bandwidth requirements are modest as well since web crawling is the rate-limiting step.</li><li>Data processing is an I/O-intensive data analytics workload. Read bandwidth is critical here, but data is accessed in large transactions and most of the computation is embarrassingly parallel. This workload runs on standalone analytics clusters, so even though the read bandwidth here is rate-limiting, slower storage is not going to impact GPU utilization on training clusters in any way. This step also reduces data by 100x or more, so the write requirements are also modest.</li><li>Training requires both loading input tokens and checkpointing model weights. However, both of these workloads lean on node-local NVMe in every node to eliminate slowdowns due to noisy neighbors. Input data is staged to node-local storage only once at the beginning of a training campaign, and checkpoints are asynchronously bled out to shared storage without impacting GPU utilization.</li><li>Inferencing involves infrequent, read-only, bulk loading of model weights into GPU nodes. While key-value caches and vector databases are also used in inferencing, parallel file systems offer no particular benefit for them.</li></ul><p style=\"text-align: left;\">The I/O patterns of each of these steps map nicely to object storage since they are predominantly write-once and whole-file transactions. Parallel file systems certainly can be used, and workloads will benefit from the high bandwidth they offer. However, they come with the cost of features that aren't necessary--either literal costs (in the case of appliances or proprietary software) or figurative costs (allocating people to manage the complexities of debugging a parallel file system).</p><p style=\"text-align: left;\">The importance of this latter point is hard to appreciate if you've never used a supercomputer without a parallel file systems. However, I recently sat in on the validation of <a href=\"https://www.top500.org/system/180349/\">a brand-new H200 training cluster</a> where various InfiniBand congestion and routing issues were being worked out. It wasn't until someone said \"eviction\" in some nontechnical context that I realized that the sporadic file system evictions during fabric instability were simply a non-issue. There was no cleanup of mount points after major fabric events because there was no persistent, fragile client-server state being maintained. I/Os between GPU nodes or nodes and storage might have failed during a rough patch, but they recovered and resumed on their own as soon as the fabric came back. Similarly, identity didn't matter, and all tests could be run as root because there was no implicit trust between the client kernel and remote storage. Removing the dependence between compute nodes, LDAP, and healthy file system mounts completely eliminates many of the challenges of standing up new clusters quickly.</p><h3 style=\"text-align: left;\">An ideal AI training cluster architecture</h3><p style=\"text-align: left;\">The workloads I described above form a rough outline for an AI training infrastructure which has:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li><b>A bunch of GPU nodes with a strong RDMA backend like InfiniBand</b>. Each node should have at least enough node-local SSD to store a substantial amount of the input tokens to be used for training, enough space for hierarchical checkpointing, and enough I/O bandwidth to these SSDs to support draining checkpoints from partner nodes' DRAM in just a few seconds. A separate frontend network that connects to storage is also a good idea; it ensures that asynchronous checkpoint draining won't interfere with weight synchronization in the training loop.</li><li><b>A separate CPU cluster for data processing pipelines</b>. A strong backend network will benefit the deduplication step (which is critical to producing high-quality training datasets), but more emphasis should be placed on optimizing large-transaction reads from storage. Given that CPU nodes are so much cheaper than GPU nodes, separating the data processing nodes from training nodes allows you cut more corners when optimizing this CPU cluster. Keeping data processing out-of-band of actual model training means your most data-intensive step (data processing) is decoupled from your most expensive step (training).</li><li><b>A scalable object store that supports basic write-once semantics with modest I/O bandwidth at scale</b>. This matches the needs of the workloads with the price-performance of the storage system and simplifies the recovery process if the interconnect between compute and storage gets congested. It can also serve the data needs of all stages of the training pipeline: hundreds of petabytes of raw training data, hundreds of terabytes of input tokens, and tens of terabytes of model weights all have similar performance needs and can be stored on the same infrastructure with the appropriate QOS settings.</li><li><b>A pool of general-purpose compute infrastructure for hosting the raw training data indices</b>. This can also be used to support vector databases, raw context documents for RAG, and any other ancillary services required for production inferencing.</li></ol><p style=\"text-align: left;\">By eschewing a high-performance parallel file system and localizing I/O performance to inside the GPU cluster with node-local NVMe, a vanilla network between the GPU cluster and the other subsystems is sufficient. Although less high-performance, these non-critical bits (ideally) have lower complexity, maintenance, and supportability as well, allowing (again, ideally) more resources to be sloshed towards supporting the high-value GPU infrastructure.</p><p style=\"text-align: left;\">Incidentally, this architecture happens to be how most of the largest AI training clusters on which I work are designed.</p><h3 style=\"text-align: left;\">But parallel files aren't all bad</h3><p style=\"text-align: left;\">Of course, having no parallel file system presents some usability challenges if users are expecting to be able to SSH into a login node and have a complete user environment ready. The user experience for the above infrastructure works best for those who are comfortable developing software in containers and launching pods rather than developing software in vim and submitting Slurm jobs. <i>I do not advocate for throwing out parallel file systems if they're already ingrained in users' workflows!</i></p><p style=\"text-align: left;\">In addition, the latest crop of modern, distributed file systems all now support multi-protocol data access. For example, <a href=\"https://docs.weka.io/4.0/additional-protocols/s3\">WEKA</a>, <a href=\"https://support.vastdata.com/s/article/UUID-67c215f7-63a8-5d58-196e-5066199a6f60\">VAST</a>, and <a href=\"https://docs.qumulo.com/administrator-guide/s3-api/configuring-using-s3-api.html\">Qumulo</a>, all support S3 (object) interfaces as first-class citizens. Users who want the traditional HPC experience can play with their data using a file mount as they always have, while those who are coming in from the cloud-native side have equal access to those same data as objects. Supporting multiprotocol access to data in AI environments doesn't reduce the need to overbuild infrastructure or support stateful file mounts across all compute nodes, but it does provide an onramp for users to get comfortable moving away from the traditional HPC user experience.</p><p style=\"text-align: left;\">Finally, a few of the leading-edge parallel-file-system-turned-AI-storage platforms are also shipping features that make them valuable for the deployment and inferencing part of the lifecycle. For example, WEKA has their <a href=\"https://www.weka.io/resources/reference-architecture/warrp-weka-ai-rag-reference-platform/\">WARRP reference architecture for RAG</a>, and <a href=\"https://www.vastdata.com/press-releases/vast-data-unveils-vast-insightengine-with-nvidia\">VAST has its InsightEngine</a>--both use the unique architectures underneath their file interfaces to accelerate vector queries far beyond what you would get from running a vector database on, say, Lustre. These so-called \"AI data platforms,\" despite starting as parallel file systems, are spreading their relevance out to the entire LLM lifecycle, filling needs for file, object, and structured data with a single storage system.</p><p style=\"text-align: left;\">This is all to say that parallel file systems aren't bad, and they aren't going anywhere. But they aren't required to train frontier models either, and as I've tried to describe above, some of the largest supercomputers on the planet are designed not to require them.</p><p></p><p></p><p></p><p></p>",
            "url": "https://hpc.social/personal-blog/2025/llm-training-without-a-parallel-file-system/",
            
            
            
            
            
            "date_published": "2025-02-02T03:59:00-07:00",
            "date_modified": "2025-02-02T03:59:00-07:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/surfing-the-singularity-the-world-is-not-flat/",
            "title": "Surfing the Singularity - The World is Not Flat",
            "summary": null,
            "content_text": "As Bill Gates recalls in his recent book-bumping interview with the Wall Street Journal, in the early innocent days of Microsoft he and his co-founder Paul Allen didn't believe in having an office in Washington, D.C.[1] They were soon to learn that was a mistake.[2] Compare and contrast with the scene in the Capital Rotunda last week for the inauguration of the new populist administration - Microsoft, Amazon, Facebook, Apple, Google, TikTok, and of course Tesla, all represented by their CEOs.[3] Microsoft's market capitalization as of this writing is now greater than the GDP of France.[4] Elon Musk's personal wealth is on par with the GDP of Denmark. Meta's platforms reach an estimated 40% of the world's population. Apple ad from yesteryear. We're now long past 1984.Consider these other inconvenient truths about global technology: that NVIDIA does not make the GPUs it designs, that most are manufactured by TSMC in Taiwan, which is about as far away from China as Cuba is from Florida. Software talent is globally distributed, and prices vary widely. There are some very good schools in some of these relatively inexpensive places - in the 2024 edition of the ACM student programming contest, MIT placed the highest among US teams, in 11th place.[5] Software salaries 2023, by country, exchange rate normalized.[6]AI programs and quantum computing initiatives are increasingly becoming nationalized as a strategic imperative. It is assumed that the country which is first to Artificial General Intelligence (AGI) will be the first to be able to use it to take control of the world. This fear is fueling an arms race in AI architectures, and the chips and the energy stations which power them. But is this a rational fear? Unlearn What You Have LearnedDeepSeek, with its deja vu inducing name and similarly eerily similar whale icon, is a new AI chatbot model wholly owned out of Hangzhou, China. And its #1 on the Apple app store, with a bullet. And yes, its quite up front that it tracks your data. There are several notable features and claims about this model: that is was trained in a fraction of the typical cost and time on a fraction of the typical hardware. That it performs about as well the OpenAI model released last month, maybe not as well on some things like straight math, but perhaps better at general writing, maybe with a bit more \"personality\". It shows its work - how it arrived at the answer its providing to the chat prompt. And, for the kicker from the totalitarian state, its open source up on Hugging Face.[7] Shots fired. NVIDIA shares tumbled for a loss of $600B in one day in response, the largest single day loss in history. But what I'd like to know is, if this is the open source model, what's the real one like? Recent Federal governments have tried to enforce policies limiting technology exports to China, but in truth they've long had their own development programs - China has not participated in standardized HPC metric sharing and supercomputer ranking since 2017.Red Flag on the TrackAlong with NVIDIA, power generation stocks were also down hard - GE Vernova down almost 20%. This does not mean the recent trend of tech companies buying nuclear power plants won't continue, or that we won't continue to hear of existing power stations giving data centers a direct hard wire bypassing the municipal grid. But clearly this open ended hunger for power - watts and GPU cycles - is not sustainable. And DeepSeek exposes that bare.But make no mistake - this race is not over, its just warming up. OpenAI with their new Operators product currently defines AGI as a gaggle of collaborating AI agents, each with its own unique set of capabilities and goals. NVIDIA CEO Jansen Huang does his part in driving the GPU-dependent AI hype cycle by saying IT departments will become the new HR departments - for AI employees.[9] Goldman Sachs is telling clients to expect AI employees this year.[10] Cost avoidance will be a major driver.[11] And why not, when the same major technology companies report seemly amazing results using AI for software development? Google saves 50% on code migration time with AI! It gives *me* FOMO! [12]The CEO of Anthropic predicts that by 2027 AI will be generally better than humans at almost everything.[13] Well, at some things maybe better than others. Turns out, what's the number one occupation we expect to be replaced by AI? Why, AI engineers and data scientists.Job skills impacted by generative AI, ranked. Maybe I should have been a plumber.[14]This is vast uncharted territory for companies, especially for those of size, wedded to their legacy political structures and being either un-nimble or worse, fragile. People are not machines. The mistakes AI agents make are not of the same kind made by humans - current generative AI models are intentionally designed to make stuff up, to not just say \"I don't know\".[15] As a manager, you'll get no benefit of human insights into the truth such as from body language, though you may get to understand the \"personality\" and general performance characteristics of your AI employees over time. That is, until the managers are replaced by AI. But until then, what does your management interface look like? Is it perhaps similar to the IDE for a senior software engineer who manages a team of AI coders? In the AI-laced future, there will still be a place for HCI/UX designers.Chef of the FutureFinally, the November 2024 report from the National Academies on the \"future of work\" says \"its impossible to predict exactly the nature of the coming changes in AI and all their effects on the economy and society\".[16] This includes how it changes the nature of various jobs, or outright eliminates them. Continuing education will be key to a resilient workforce, and it turns out, AI might even play a role in that.As shown in Washington last week, there are new business trends being driven by a rejuvenated alliance between big technology and big government, and it is therefore time for astute technology and business leaders to pay attention to both.[17] For example, tomorrow (January 30, 2025) OpenAI is holding a previously scheduled closed door meeting in Washington regarding its own current agentic technology innovations, and what they potentially imply for the US people and its government. I imagine the topic of DeepSeek will now disrupt the meeting agenda, somewhat. At minimum, its a topic for a future blog. Regards. - andyReferences[0] Photo by AJ Colores on Unsplash, https://unsplash.com/@ajcolores      [1] Bill Gates interview by the Wall Street Journal, January 2025, https://www.youtube.com/watch?v=4LL-ynK_exM[2] US vs. Microsoft, https://en.wikipedia.org/wiki/United_States_v._Microsoft_Corp[3] https://apnews.com/article/trump-inauguration-tech-billionaires-zuckerberg-musk-wealth-0896bfc3f50d941d62cebc3074267ecd[4] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)[5] https://icpc.global/worldfinals/results [6] https://www.reddit.com/r/dataisbeautiful/comments/17a63yo/oc_2023_developer_compensation_by_country taken from the 2023 Stack Overflow developer survey.[7] DeepSeek at Hugging Face: https://huggingface.co/organizations/deepseek-ai/activity/all[8] OpenAI Operators: https://www.nytimes.com/2025/01/23/technology/openai-operator-launch.html [9] NVIDIA CEO Jensen Huang on IT as the new HR: https://www.aol.com/finance/nvidia-jensen-huang-says-become-133641793.html[10] Goldman Sachs on the rise of AI employees: https://it.slashdot.org/story/25/01/21/2213230/managing-ai-agents-as-employees-is-the-challenge-of-2025-says-goldman-sachs-cio[11] https://www.msn.com/en-us/money/markets/why-cost-avoidance-became-an-ai-buzzword-for-holding-down-headcount/ar-BB1rmJSx[12] https://developers.slashdot.org/story/25/01/17/2156235/google-reports-halving-code-migration-time-with-ai-help[13] https://arstechnica.com/ai/2025/01/anthropic-chief-says-ai-could-surpass-almost-all-humans-at-almost-everything-shortly-after-2027/[14] https://reports.weforum.org/docs/WEF_Future_of_Jobs_Report_2025.pdf[15] https://slashdot.org/story/25/01/23/1645242/ai-mistakes-are-very-different-from-human-mistakes[16] https://nap.nationalacademies.org/resource/27644/interactive/[17] https://hbr.org/2024/11/navigating-the-new-geopolitics-of-tech",
            "content_html": "<div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>As Bill Gates recalls in his recent book-bumping interview with the Wall Street Journal, in the early innocent days of Microsoft he and his co-founder Paul Allen didn't believe in having an office in Washington, D.C.[1] They were soon to learn that was a mistake.[2] Compare and contrast with the scene in the Capital Rotunda last week for the inauguration of the new populist administration - Microsoft, Amazon, Facebook, Apple, Google, TikTok, and of course Tesla, all represented by their CEOs.[3] Microsoft's market capitalization as of this writing is now greater than the GDP of France.[4] Elon Musk's personal wealth is on par with the GDP of Denmark. Meta's platforms reach an estimated 40% of the world's population. </p><table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\"><tbody><tr><td style=\"text-align: center;\"></td></tr><tr><td class=\"tr-caption\" style=\"text-align: center;\">Apple ad from yesteryear. We're now long past 1984.</td></tr></tbody></table><p>Consider these other inconvenient truths about global technology: that NVIDIA does not make the GPUs it designs, that most are manufactured by TSMC in Taiwan, which is about as far away from China as Cuba is from Florida. Software talent is globally distributed, and prices vary widely. There are some very good schools in some of these relatively inexpensive places - in the 2024 edition of the ACM student programming contest, MIT placed the highest among US teams, in 11th place.[5] </p><table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\"><tbody><tr><td style=\"text-align: center;\"></td></tr><tr><td class=\"tr-caption\" style=\"text-align: center;\">Software salaries 2023, by country, exchange rate normalized.[6]</td></tr></tbody></table><p>AI programs and quantum computing initiatives are increasingly becoming nationalized as a strategic imperative. It is assumed that the country which is first to Artificial General Intelligence (AGI) will be the first to be able to use it to take control of the world. This fear is fueling an arms race in AI architectures, and the chips and the energy stations which power them. But is this a rational fear? </p><h3 style=\"text-align: left;\">Unlearn What You Have Learned</h3><p>DeepSeek, with its deja vu inducing name and similarly eerily similar whale icon, is a new AI chatbot model wholly owned out of Hangzhou, China. And its #1 on the Apple app store, with a bullet. And yes, its quite up front that it tracks your data. There are several notable features and claims about this model: that is was trained in a fraction of the typical cost and time on a fraction of the typical hardware. That it performs about as well the OpenAI model released last month, maybe not as well on some things like straight math, but perhaps better at general writing, maybe with a bit more \"personality\". It shows its work - how it arrived at the answer its providing to the chat prompt. And, for the kicker from the totalitarian state, its open source up on Hugging Face.[7] </p><p>Shots fired. NVIDIA shares tumbled for a loss of $600B in one day in response, the largest single day loss in history. But what I'd like to know is, if this is the open source model, what's the real one like? Recent Federal governments have tried to enforce policies limiting technology exports to China, but in truth they've long had their own development programs - China has not participated in standardized HPC metric sharing and supercomputer ranking since 2017.</p><h3 style=\"text-align: left;\">Red Flag on the Track</h3><p>Along with NVIDIA, power generation stocks were also down hard - GE Vernova down almost 20%. This does not mean the recent trend of tech companies buying nuclear power plants won't continue, or that we won't continue to hear of existing power stations giving data centers a direct hard wire bypassing the municipal grid. But clearly this open ended hunger for power - watts and GPU cycles - is not sustainable. And DeepSeek exposes that bare.</p><p>But make no mistake - this race is not over, its just warming up. OpenAI with their new Operators product currently defines AGI as a gaggle of collaborating AI agents, each with its own unique set of capabilities and goals. NVIDIA CEO Jansen Huang does his part in driving the GPU-dependent AI hype cycle by saying IT departments will become the new HR departments - for AI employees.[9] Goldman Sachs is telling clients to expect AI employees this year.[10] Cost avoidance will be a major driver.[11] And why not, when the same major technology companies report seemly amazing results using AI for software development? Google saves 50% on code migration time with AI! It gives *me* FOMO! [12]</p><p>The CEO of Anthropic predicts that by 2027 AI will be generally better than humans at almost everything.[13] Well, at some things maybe better than others. Turns out, what's the number one occupation we expect to be replaced by AI? Why, AI engineers and data scientists.</p><table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\"><tbody><tr><td style=\"text-align: center;\"></td></tr><tr><td class=\"tr-caption\" style=\"text-align: center;\">Job skills impacted by generative AI, ranked. <br />Maybe I should have been a plumber.[14]</td></tr></tbody></table><p>This is vast uncharted territory for companies, especially for those of size, wedded to their legacy political structures and being either un-nimble or worse, fragile. People are not machines. The mistakes AI agents make are not of the same kind made by humans - current generative AI models are intentionally designed to make stuff up, to not just say \"I don't know\".[15] As a manager, you'll get no benefit of human insights into the truth such as from body language, though you may get to understand the \"personality\" and general performance characteristics of your AI employees over time. That is, until the managers are replaced by AI. But until then, what does your management interface look like? Is it perhaps similar to the IDE for a senior software engineer who manages a team of AI coders? In the AI-laced future, there will still be a place for HCI/UX designers.</p><h3 style=\"text-align: left;\">Chef of the Future</h3><p>Finally, the November 2024 report from the National Academies on the \"future of work\" says \"its impossible to predict exactly the nature of the coming changes in AI and all their effects on the economy and society\".[16] This includes how it changes the nature of various jobs, or outright eliminates them. Continuing education will be key to a resilient workforce, and it turns out, AI might even play a role in that.</p><p>As shown in Washington last week, there are new business trends being driven by a rejuvenated alliance between big technology and big government, and it is therefore time for astute technology and business leaders to pay attention to both.[17] For example, tomorrow (January 30, 2025) OpenAI is holding a previously scheduled closed door meeting in Washington regarding its own current agentic technology innovations, and what they potentially imply for the US people and its government. I imagine the topic of DeepSeek will now disrupt the meeting agenda, somewhat. </p><p>At minimum, its a topic for a future blog. Regards. - andy</p><p><br /></p><h3 style=\"text-align: left;\">References</h3><div><div><span style=\"font-size: x-small;\">[0] Photo by AJ Colores on Unsplash, https://unsplash.com/@ajcolores</span></div><div><span style=\"font-size: x-small;\">      </span></div><div><span style=\"font-size: x-small;\">[1] Bill Gates interview by the Wall Street Journal, January 2025, https://www.youtube.com/watch?v=4LL-ynK_exM</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[2] US vs. Microsoft, https://en.wikipedia.org/wiki/United_States_v._Microsoft_Corp</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[3] https://apnews.com/article/trump-inauguration-tech-billionaires-zuckerberg-musk-wealth-0896bfc3f50d941d62cebc3074267ecd</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[4] https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[5] https://icpc.global/worldfinals/results </span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[6] https://www.reddit.com/r/dataisbeautiful/comments/17a63yo/oc_2023_developer_compensation_by_country taken from the 2023 Stack Overflow developer survey.</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[7] DeepSeek at Hugging Face: https://huggingface.co/organizations/deepseek-ai/activity/all</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[8] OpenAI Operators: https://www.nytimes.com/2025/01/23/technology/openai-operator-launch.html</span></div><div><span style=\"font-size: x-small;\"> </span></div><div><span style=\"font-size: x-small;\">[9] NVIDIA CEO Jensen Huang on IT as the new HR: https://www.aol.com/finance/nvidia-jensen-huang-says-become-133641793.html</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[10] Goldman Sachs on the rise of AI employees: https://it.slashdot.org/story/25/01/21/2213230/managing-ai-agents-as-employees-is-the-challenge-of-2025-says-goldman-sachs-cio</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[11] https://www.msn.com/en-us/money/markets/why-cost-avoidance-became-an-ai-buzzword-for-holding-down-headcount/ar-BB1rmJSx</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[12] https://developers.slashdot.org/story/25/01/17/2156235/google-reports-halving-code-migration-time-with-ai-help</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[13] https://arstechnica.com/ai/2025/01/anthropic-chief-says-ai-could-surpass-almost-all-humans-at-almost-everything-shortly-after-2027/</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[14] https://reports.weforum.org/docs/WEF_Future_of_Jobs_Report_2025.pdf</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[15] https://slashdot.org/story/25/01/23/1645242/ai-mistakes-are-very-different-from-human-mistakes</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[16] https://nap.nationalacademies.org/resource/27644/interactive/</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><span style=\"font-size: x-small;\">[17] https://hbr.org/2024/11/navigating-the-new-geopolitics-of-tech</span></div><div><span style=\"font-size: x-small;\"><br /></span></div><div><br /></div></div>",
            "url": "https://hpc.social/personal-blog/2025/surfing-the-singularity-the-world-is-not-flat/",
            
            
            
            
            
            "date_published": "2025-01-29T18:11:00-07:00",
            "date_modified": "2025-01-29T18:11:00-07:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/fine-tuning-ai-models-with-instructlab-under-ibm-lsf/",
            "title": "Fine tuning AI models with InstructLab under IBM LSF",
            "summary": null,
            "content_text": "OverviewAll the best for 2025! This blog looks back on a demo which I created for SC24last November to demonstrate InstructLab workflows running on an IBM LSFcluster. Let’s begin with a bit of background. I’d like to thank MichaelSpriggs, STSM, IBM LSF for his contributions to this blog.When I think of tuning, what immediately comes to my mind are visions of anexpert mechanic trying to extract the most from an engine. This blog isfocused on an entirely different type of tuning, AI model tuning. Like tuningan engine, AI model tuning can be used to ensure a better fit for a given AImodel for your business.Released by IBM and Red Hat in May 2024, InstructLab is an open-source projectwhich provides the ability to fine-tune LLMs by adding skills and knowledge,without having to retrain the model from scratch. InstructLab can run onresource-constrained systems such as laptops, but also supports GPUs. Much hasbeen written about InstructLab and this blog is not intended to provide anin-depth look at InstructLab. Rather, the objective here is to demonstrate howInstructLab workloads can be distributed and managed in a high-performancecomputing cluster with GPUs using the IBM LSF workload scheduler. Recently, IBMpublished a paper describing the infrastructure used to train the Granite familyof AI foundation models. The paper describes the Vela and Blue Vela environmentsin detail. In particular, the Blue Vela environment is built on a software stackusing Red Hat Enterprise Linux, IBM LSF and Storage Scale. Learn more in thedetailed paper here.The demo workflow consists of two LSF jobs. The first job generates syntheticdata, which is used to teach the LLM new skills or knowledge. The second job,which depends upon the successful completion of the first, is the training job,where the new skills or knowledge are incorporated into an existing base model.A simple LSF job dependency is used to ensure the training job only runs afterthe successful completion of the synthetic data generation step.The environment used is equipped with Nvidia GPUs.  InstructLab jobs will berun with the options for GPU support, and the jobs will be submitted to LSFwith the appropriate GPU scheduling directives. Furthermore, it is assumed thatthe users' $HOME directory is available on all hosts in the cluster. Note that Irequire neither root access, nor a user account that is an LSF administrator, toinstall and use InstructLab on the LSF cluster.ConfigurationThe HPC cluster is configured as follows:Red Hat Enterprise Linux v8.8IBM LSF v10.0.1.15InstructLab v0.19.4Miniforge v3 (24.9.0-0)NVIDIA CUDA v12.6Compute nodes are equipped with 8 x Nvidia H100 GPUsInstall InstructLabLog in to a compute node in the LSF cluster equipped with GPUs. If ssh accessis disabled to compute nodes, then submit an interactive LSF batch job. This jobrequests 8 GPUs on a single system and will set them to exclusive executionmode.$ bsub -Is -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" bashInstall and set up a Conda environment. This will enable you to install aself-contained Conda environment for your user account with the necessaryPython version needed for InstructLab. Miniforge is installed in the defaultlocation and the option to update the users shell profile to start the Condaenvironment are selected. We assume here a shared $HOME directory.$ cd $HOME$ curl -L -O \"https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh\"$ bash Miniforge3-$(uname)-$(uname -m).shBefore proceeding, you must logout and log back in to activate theenvironment. Next, a Conda environment is created with name my_env. Here we’llspecify Python v3.11, which is a requirement for InstructLab.conda create --name my_env -c anaconda python=3.11conda activate my_envNext, install InstructLab. Here, version 0.19.4 of InstructLab is specified.This was the version of InstructLab available in the timeframe preceding theSC24 event. Follow the installation steps in the official InstructLabdocumentation here.$ pip install instructlab==0.19.4Next, perform the installation of InstructLab with Nvidia CUDA support. Thisis required for InstructLab to utilize the GPUs. Without this step, InstructLabwill run on the CPUs. Note that CUDA v12.6 is installed on the system and thevariables set below reflect this.$ export CMAKE_ARGS=\"-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.6 -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.6/lib64\"$ export PATH=/usr/local/cuda-12.6/bin:$PATH$ pip cache remove llama_cpp_python$ CMAKE_ARGS=\"-DLLAMA_CUDA=on -DLLAMA_NATIVE=off\" pip install 'instructlab[cuda]'$ pip install vllm@git+https://github.com/opendatahub-io/vllm@v0.6.2Configure InstructLabWith the installation of InstructLab complete, the next step is to run theinitialization. This will setup paths to models, taxonomy repo as well as theGPU configuration.$ ilab config initBy default InstructLab stores models, training checkpoints and other fileswithin ~/.cache and ~/.local/share/instructlab. If you have limited storagecapacity available in $HOME, then you may opt to disable training checkpointfiles. This can be done by setting the following option in ~/.config/instructlab/config.yaml as follows.train:  checkpoint_at_epoch: falseNext, we download the required models. The ilab model list command can beused to list the models which are available. Note that a HuggingFace token isrequired to download certain models. Please set HF_TOKEN in the environmentwith the appropriate token.$ export HF_TOKEN=&lt;HuggingFace token&gt;$ ilab model download$ ilab model download --repository=instructlab/granite-7b-lab$ ilab model list+--------------------------------------+---------------------+---------+| Model Name                           | Last Modified       | Size    |+--------------------------------------+---------------------+---------+| instructlab/granite-7b-lab           | 2024-12-27 20:37:29 | 12.6 GB || mistral-7b-instruct-v0.2.Q4_K_M.gguf | 2024-12-27 16:55:46 | 4.1 GB  || merlinite-7b-lab-Q4_K_M.gguf         | 2024-12-27 16:48:39 | 4.1 GB  |+--------------------------------------+---------------------+---------+Generate synthetic data &amp; AI model trainingNext, is the synthetic data generation step, which will be executed on GPUs.This step is a prerequisite to teaching the LLM new skills/knowledge viatraining.Here we use example knowledge from the InstructLab github about Taylor Swiftfans, who are known as “Swifties”. This is timely because Taylor Swift recentlywrapped up 6 concerts in Toronto, Canada, where I happen to be based. Copyattribution.txt and qna.yaml from the following location.By default, the InstructLab taxonomy is found in ~/.local/share/instructlab/taxonomy. Here we create the directories fandom/swifties under ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom and copy the files from step 1 intothis location.$ mkdir -p ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties$ cp &lt;path_to&gt;/attribution.txt ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties$ cp &lt;path_to&gt;/qna.yaml ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swiftiesWith the Swifties taxonomy in place, check for any syntax errors with thecommand ilab taxonomy diff. It should report that the taxonomy is valid ifthere are no syntax errors.$ ilab taxonomy diffknowledge/arts/fandom/swifties/qna.yamlTaxonomy in /u/gsamu/.local/share/instructlab/taxonomy is valid :)With the taxonomy in place and having confirmed that the syntax is valid,it’s now time to run the synthetic data generation job through LSF. Here we willrequest 8 GPUs on a single server in exclusive execution mode. For theInstructLab ilab command, specify the &ndash;gpus 8 and &ndash;pipeline full options.Standard output is written to the $HOME/job-output with filename specification&lt;LSF_JOBID&gt;.out. The $HOME/job-output directory must already exist.$ mkdir -p $HOME/job-output$ bsub -o $HOME/job-output/%J.out -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" ilab data generate --pipeline full --gpus 8Job &lt;1131&gt; is submitted to default queue &lt;normal&gt;.During job execution, the LSF bpeek command can be used to monitor the jobstandard output.$ bpeek -f 1131 &lt;&lt; output from stdout &gt;&gt;INFO 2025-01-02 09:51:29,503 numexpr.utils:146: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check \"NUMEXPR_MAX_THREADS\" environment variable.INFO 2025-01-02 09:51:29,504 numexpr.utils:149: Note: NumExpr detected 96 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 16.INFO 2025-01-02 09:51:29,504 numexpr.utils:162: NumExpr defaulting to 16 threads.INFO 2025-01-02 09:51:30,038 datasets:59: PyTorch version 2.3.1 available.INFO 2025-01-02 09:51:31,226 instructlab.model.backends.llama_cpp💯 Trying to connect to model server at http://127.0.0.1:8000/v1WARNING 2025-01-02 09:51:56,356 instructlab.data.generate:270: Disabling SDG batching - unsupported with llama.cpp servingGenerating synthetic data using 'full' pipeline, '/u/gsamu/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/u/gsamu/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:55779/v1 serverINFO 2025-01-02 09:51:56,861 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.INFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:153: Running pipeline single-threadedINFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:197: Running block: duplicate_document_colINFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],    num_rows: 35})INFO 2025-01-02 09:51:58,286 instructlab.sdg.llmblock:51: LLM server supports batched inputs: FalseINFO 2025-01-02 09:51:58,286 instructlab.sdg.pipeline:197: Running block: gen_spellcheckINFO 2025-01-02 09:51:58,286 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document'],    num_rows: 35})/u/gsamu/miniforge3/envs/my_env/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading \"&lt;s&gt;\" in prompt, this will likely reduce response quality, consider removing it...  warnings.warn(INFO 2025-01-02 09:57:42,264 instructlab.sdg.pipeline:197: Running block: flatten_auxiliary_columnsINFO 2025-01-02 09:57:42,264 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document', 'spellcheck'],    num_rows: 35})INFO 2025-01-02 09:57:42,279 instructlab.sdg.pipeline:197: Running block: rename_to_document_columnINFO 2025-01-02 09:57:42,279 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'dataset_type', 'corrected_document'],    num_rows: 70})INFO 2025-01-02 09:57:42,282 instructlab.sdg.pipeline:197: Running block: gen_knowledgeINFO 2025-01-02 09:57:42,282 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'raw_document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'dataset_type', 'document'],    num_rows: 70})……During the runtime of the job, it’s possible to view GPU related metricsusing the LSF lsload and bhosts commands. First, we need to identify the hostwhere the job has been dispatched to using the LSF bjobs command. In this casethe job was dispatched to host p1-r01-n4. Note that details GPU accountingmetrics are available once the job runs to completion.$ bjobs -wJOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME1131    gsamu   RUN   normal     rmf-login-1 p1-r01-n4   ilab data generate --pipeline full --gpus 8 Jan  2 14:51$ lsload -w -gpu p1-r01-n4HOST_NAME                 status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physicalp1-r01-n4                     ok     8                 2%                7%              8$ bhosts -w -gpu p1-r01-n4HOST_NAME            GPU_ID                MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV p1-r01-n4                 0   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          1   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          2   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          3   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          4   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          5   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          6   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          7   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0After job completion, it’s possible to view details about the job includingGPU utilization which LSF collects by leveraging NVIDIA DCGM. These metrics areavailable upon job completion using both the LSF bhist and bjobs commands.$ bhist -l -gpu 1131Job &lt;1131&gt;, User &lt;gsamu&gt;, Project &lt;default&gt;, Command &lt;ilab data generate --pipe                          line full --gpus 8&gt;Thu Jan  2 14:51:23 2025: Submitted from host &lt;rmf-login-1&gt;, to Queue &lt;normal&gt;,                           CWD &lt;$HOME&gt;, Output File &lt;/u/gsamu/job-output/%J.out                          &gt;, Requested Resources &lt;span[hosts=1]&gt;, Requested GPU                           &lt;num=8:j_exclusive=yes&gt;;Thu Jan  2 14:51:24 2025: Dispatched 1 Task(s) on Host(s) &lt;p1-r01-n4&gt;, Allocate                          d 1 Slot(s) on Host(s) &lt;p1-r01-n4&gt;, Effective RES_REQ                           &lt;select[((ngpus&gt;0)) &amp;&amp; (type == local)] order[r15s:p                          g] rusage[ngpus_physical=8.00] span[hosts=1] &gt;;Thu Jan  2 14:51:25 2025: Starting (Pid 3095851);Thu Jan  2 14:51:25 2025: External Message \"p1-r01-n4:gpus=0,1,2,3,4,5,6,7;EFFE                          CTIVE GPU REQ: num=8:mode=shared:mps=no:j_exclusive=y                          es:gvendor=nvidia;\" was posted from \"gsamu\" to messag                          e box 0;Thu Jan  2 14:51:26 2025: Running with execution home &lt;/u/gsamu&gt;, Execution CWD                           &lt;/u/gsamu&gt;, Execution Pid &lt;3095851&gt;;Thu Jan  2 16:08:05 2025: Done successfully. The CPU time used is 4624.0 second                          s;                          HOST: p1-r01-n4; CPU_TIME: 4624 seconds                                                        GPU ID: 0                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 579704 Joules                                  SM Utilization (%): Avg 9, Max 15, Min 0                                  Memory Utilization (%): Avg 2, Max 100, Min 0                                  Max GPU Memory Used: 1956642816 bytes                              GPU ID: 1                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 503956 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1767899136 bytes                              GPU ID: 2                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 501754 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1784676352 bytes                              GPU ID: 3                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 525195 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 54, Min 0                                  Max GPU Memory Used: 1767899136 bytes                              GPU ID: 4                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 525331 Joules                                  SM Utilization (%): Avg 7, Max 12, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1767899136 bytes                              GPU ID: 5                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 502416 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1784676352 bytes                              GPU ID: 6                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 508720 Joules                                  SM Utilization (%): Avg 7, Max 12, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1784676352 bytes                              GPU ID: 7                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 491041 Joules                                  SM Utilization (%): Avg 6, Max 12, Min 0                                  Memory Utilization (%): Avg 2, Max 4, Min 0                                  Max GPU Memory Used: 1933574144 bytesGPU Energy Consumed: 4138117.000000 JoulesThu Jan  2 16:08:05 2025: Post job process done successfully;GPU_ALLOCATION: HOST             TASK GPU_ID  GI_PLACEMENT/SIZE    CI_PLACEMENT/SIZE    MODEL        MTOTAL  FACTOR MRSV    SOCKET NVLINK/XGMI                       p1-r01-n4        0    0       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    1       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    2       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    3       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    4       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    5       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    6       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    7       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                               MEMORY USAGE:MAX MEM: 2 Gbytes;  AVG MEM: 1 Gbytes; MEM Efficiency: 0.00%CPU USAGE:CPU PEAK: 1.69 ;  CPU PEAK DURATION: 52 second(s)CPU AVERAGE EFFICIENCY: 100.69% ;  CPU PEAK EFFICIENCY: 169.23%Summary of time in seconds spent in various states by  Thu Jan  2 16:08:05 2025  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL  1        0        4601     0        0        0        4602 When the synthetic data generation job completes, it’s output can be viewedat ~/job-output/.out. The synthetic data sets will comprise files inthe directory ~/.local/share/instructlab/datasets. These files will be named*skills_train_msgs_*.jsonl* and *knowledge_train_msgs_*.jsonl*.With the synthetic data generation step complete, it’s now time to run thetraining. We first set 2 environment variables to point to the followingfiles:  ~/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl  and ~./.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl.Afterward, we submit the training job to LSF requesting 8 GPUs and with ilaboptions &ndash;pipeline accelerated, &ndash;gpus 8, &ndash;device cuda and&ndash;data-path pointing to the two above data files that were produced in thesynthetic data generation step.$ export SKILLS_PATH=/u/gsamu/.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl$ export KNOWLEDGE_PATH=/u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl$ bsub -o $HOME/job-output/%J.out -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" ilab model train --pipeline accelerated --data-path $SKILLS_PATH --data-path $KNOWLEDGE_PATH --device cuda --gpus 8Job &lt;1135&gt; is submitted to default queue &lt;normal&gt;.During job execution, the LSF bpeek command can be used to monitor thejob standard output.$ bpeek -f 1135&lt;&lt; output from stdout &gt;&gt;LoRA is disabled (rank=0), ignoring all additional LoRA args[2025-01-02 12:52:04,359] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)INFO 2025-01-02 12:52:09,061 numexpr.utils:146: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check \"NUMEXPR_MAX_THREADS\" environment variable.INFO 2025-01-02 12:52:09,061 numexpr.utils:149: Note: NumExpr detected 96 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 16.INFO 2025-01-02 12:52:09,061 numexpr.utils:162: NumExpr defaulting to 16 threads.INFO 2025-01-02 12:52:09,304 datasets:59: PyTorch version 2.3.1 available.You are using the default legacy behaviour of the &lt;class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'&gt;. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.INFO 2025-01-02 12:52:09,653 root:617: Special tokens: eos: [32000], pad: [32001], bos: [32005], system: [32004], user: [32002], assistant: [32003]INFO 2025-01-02 12:52:09,923 root:617: number of dropped samples: 0 -- out of 641 data arguments are:{\"data_path\":\"/u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl\",\"data_output_path\":\"/u/gsamu/.local/share/instructlab/internal\",\"max_seq_len\":4096,\"model_path\":\"/u/gsamu/.cache/instructlab/models/instructlab/granite-7b-lab\",\"chat_tmpl_path\":\"/u/gsamu/miniforge3/envs/my_env/lib/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py\",\"num_cpu_procs\":16}tokenizing the dataset with /u/gsamu/.cache/instructlab/models/instructlab/granite-7b-lab tokenizer...ten largest length percentiles:quantile 90th: 1459.0quantile 91th: 1466.0quantile 92th: 1469.6000000000001quantile 93th: 1478.2quantile 94th: 1483.0quantile 95th: 1488.0quantile 96th: 1497.1999999999998quantile 97th: 1516.5999999999997quantile 98th: 1540.6000000000001quantile 99th: 1656.0000000000016quantile 100th: 2578.0at 4096 max sequence length, the number of samples to be dropped is 0(0.00% of total)quantile 0th: 368.0quantile 1th: 393.0quantile 2th: 411.2quantile 3th: 421.2quantile 4th: 427.2quantile 5th: 442.0quantile 6th: 604.4quantile 7th: 631.8quantile 8th: 653.8000000000001quantile 9th: 679.8quantile 10th: 742.0at 20 min sequence length, the number of samples to be dropped is 0checking the validity of the samples...Categorizing training data type...unmasking the appropriate message content... Samples Previews...……During the runtime of the training job, we can observe some GPU utilizationinformation using the LSF lsload and bhosts commands.  First we need to identifythe server on which the training job is running. This is done using the bjobscommand and checking for the execution host of the job.$ bjobs -wJOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME1135    gsamu   RUN   normal     rmf-login-1 p1-r01-n1   ilab model train --pipeline accelerated --data-path /u/gsamu/.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl --data-path /u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl --device cuda --gpus 8 Jan  2 17:51$ lsload -w -gpu p1-r01-n1HOST_NAME                 status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physicalp1-r01-n1                     ok     8                 0%               22%              8$ bhosts -w -gpu p1-r01-n1HOST_NAME            GPU_ID                MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV p1-r01-n1                 0   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          1   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          2   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          3   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          4   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          5   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          6   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          7   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0Once the job is complete, detailed GPU accounting can again be viewed usingthe LSF bhist command as follows below.$ bhist -l -gpu 1135Job &lt;1135&gt;, User &lt;gsamu&gt;, Project &lt;default&gt;, Command &lt;ilab model train --pipeli                          ne accelerated --data-path /u/gsamu/.local/share/inst                          ructlab/datasets/skills_train_msgs_2025-01-02T09_51_5                          6.jsonl --data-path /u/gsamu/.local/share/instructlab                          /datasets/knowledge_train_msgs_2025-01-02T09_51_56.js                          onl --device cuda --gpus 8&gt;Thu Jan  2 17:51:48 2025: Submitted from host &lt;rmf-login-1&gt;, to Queue &lt;normal&gt;,                           CWD &lt;$HOME/.local/share/instructlab/checkpoints&gt;, Ou                          tput File &lt;/u/gsamu/job-output/%J.out&gt;, Requested Res                          ources &lt;span[hosts=1]&gt;, Requested GPU &lt;num=8:j_exclus                          ive=yes&gt;;Thu Jan  2 17:51:48 2025: Dispatched 1 Task(s) on Host(s) &lt;p1-r01-n1&gt;, Allocate                          d 1 Slot(s) on Host(s) &lt;p1-r01-n1&gt;, Effective RES_REQ                           &lt;select[((ngpus&gt;0)) &amp;&amp; (type == local)] order[r15s:p                          g] rusage[ngpus_physical=8.00] span[hosts=1] &gt;;Thu Jan  2 17:51:49 2025: Starting (Pid 3462241);Thu Jan  2 17:51:49 2025: Running with execution home &lt;/u/gsamu&gt;, Execution CWD                           &lt;/u/gsamu/.local/share/instructlab/checkpoints&gt;, Exe                          cution Pid &lt;3462241&gt;;Thu Jan  2 17:51:49 2025: External Message \"p1-r01-n1:gpus=0,1,2,3,4,5,6,7;EFFE                          CTIVE GPU REQ: num=8:mode=shared:mps=no:j_exclusive=y                          es:gvendor=nvidia;\" was posted from \"gsamu\" to messag                          e box 0;Thu Jan  2 17:57:56 2025: Done successfully. The CPU time used is 3024.0 second                          s;                          HOST: p1-r01-n1; CPU_TIME: 3024 seconds                                                        GPU ID: 0                                  Total Execution Time: 365 seconds                                  Energy Consumed: 98890 Joules                                  SM Utilization (%): Avg 20, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 62, Min 0                                  Max GPU Memory Used: 53022294016 bytes                              GPU ID: 1                                  Total Execution Time: 365 seconds                                  Energy Consumed: 97697 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 58, Min 0                                  Max GPU Memory Used: 53087305728 bytes                              GPU ID: 2                                  Total Execution Time: 365 seconds                                  Energy Consumed: 94820 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 62, Min 0                                  Max GPU Memory Used: 53221523456 bytes                              GPU ID: 3                                  Total Execution Time: 365 seconds                                  Energy Consumed: 98014 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 59, Min 0                                  Max GPU Memory Used: 53041168384 bytes                              GPU ID: 4                                  Total Execution Time: 365 seconds                                  Energy Consumed: 99246 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 60, Min 0                                  Max GPU Memory Used: 53045362688 bytes                              GPU ID: 5                                  Total Execution Time: 365 seconds                                  Energy Consumed: 94952 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 65, Min 0                                  Max GPU Memory Used: 53047459840 bytes                              GPU ID: 6                                  Total Execution Time: 365 seconds                                  Energy Consumed: 98227 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 63, Min 0                                  Max GPU Memory Used: 53127151616 bytes                              GPU ID: 7                                  Total Execution Time: 365 seconds                                  Energy Consumed: 94582 Joules                                  SM Utilization (%): Avg 52, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 65, Min 0                                  Max GPU Memory Used: 53481570304 bytesGPU Energy Consumed: 776428.000000 JoulesThu Jan  2 17:57:56 2025: Post job process done successfully;GPU_ALLOCATION: HOST             TASK GPU_ID  GI_PLACEMENT/SIZE    CI_PLACEMENT/SIZE    MODEL        MTOTAL  FACTOR MRSV    SOCKET NVLINK/XGMI                       p1-r01-n1        0    0       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    1       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    2       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    3       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    4       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    5       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    6       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    7       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                               MEMORY USAGE:MAX MEM: 104 Gbytes;  AVG MEM: 16 Gbytes; MEM Efficiency: 0.00%CPU USAGE:CPU PEAK: 17.86 ;  CPU PEAK DURATION: 49 second(s)CPU AVERAGE EFFICIENCY: 856.60% ;  CPU PEAK EFFICIENCY: 1785.71%Summary of time in seconds spent in various states by  Thu Jan  2 17:57:56 2025  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL  0        0        368      0        0        0        368         Finally, with the model successfully trained, let’s chat with the new modelto check the result. Here’s we’ll pose it Swiftie specific questions. Note thatthe output from the training is written to ~/.local/share/instructlab/checkpoints/hf_format. We’ll take the model from the latest checkpoint directory that wascreated. Here again, we launch the model chat job via LSF as an interactivebatch job (i.e. bsub -Is).$ grep hf_format 1135.outModel saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_886Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_1776Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_2658Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_3546Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_4435$ bsub -Is -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" ilab model chat --model /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_4435Job &lt;1146&gt; is submitted to default queue &lt;interactive&gt;.&lt;&lt;Waiting for dispatch ...&gt;&gt;&lt;&lt;Starting on p1-r01-n2&gt;&gt;INFO 2025-01-02 15:06:07,600 instructlab.model.backends.vllm:105: Trying to connect to model server at http://127.0.0.1:8000/v1INFO 2025-01-02 15:06:08,876 instructlab.model.backends.vllm:308: vLLM starting up on pid 3744375 at http://127.0.0.1:41531/v1INFO 2025-01-02 15:06:08,876 instructlab.model.backends.vllm:114: Starting a temporary vLLM server at http://127.0.0.1:41531/v1INFO 2025-01-02 15:06:08,876 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 1/120INFO 2025-01-02 15:06:12,244 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 2/120INFO 2025-01-02 15:06:15,614 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 3/120INFO 2025-01-02 15:06:18,801 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 4/120INFO 2025-01-02 15:06:21,952 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 5/120INFO 2025-01-02 15:06:25,391 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 6/120INFO 2025-01-02 15:06:28,638 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 7/120INFO 2025-01-02 15:06:32,103 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 8/120INFO 2025-01-02 15:06:35,296 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 9/120INFO 2025-01-02 15:06:38,616 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 10/120INFO 2025-01-02 15:06:42,015 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 11/120INFO 2025-01-02 15:06:45,435 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 12/120INFO 2025-01-02 15:06:48,679 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 13/120INFO 2025-01-02 15:06:52,025 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 14/120INFO 2025-01-02 15:06:55,317 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 15/120INFO 2025-01-02 15:06:58,604 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 16/120INFO 2025-01-02 15:07:01,927 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 17/120INFO 2025-01-02 15:07:05,287 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 18/120INFO 2025-01-02 15:07:08,763 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 19/120INFO 2025-01-02 15:07:12,131 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 20/120INFO 2025-01-02 15:07:15,476 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 21/120INFO 2025-01-02 15:07:18,881 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 22/120INFO 2025-01-02 15:07:22,203 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 23/120INFO 2025-01-02 15:07:25,599 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 24/120INFO 2025-01-02 15:07:28,991 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 25/120INFO 2025-01-02 15:07:32,234 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 26/120INFO 2025-01-02 15:07:35,714 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 27/120INFO 2025-01-02 15:07:38,974 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 28/120INFO 2025-01-02 15:07:42,265 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 29/120INFO 2025-01-02 15:07:45,582 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 30/120INFO 2025-01-02 15:07:45,586 instructlab.model.backends.vllm:136: vLLM engine successfully started at http://127.0.0.1:41531/v1╭────────────────────────────────────────────────────────────── system ──────────────────────────────────────────────────────────────╮│ Welcome to InstructLab Chat w/ SAMPLES_4435 (type /h for help)                                                                     │╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯&gt;&gt;&gt; Tell me everything you know about Swifties.                                                                           [S][default]╭─────────────────────────────────────────────────────────── samples_4435 ───────────────────────────────────────────────────────────╮│ Swifties are the fandom of the American singer-songwriter Taylor Swift.                                                            ││ Regarded by journalists as one of the largest, most devoted, and influential fan bases, Swifties are known for their high levels   ││ of participation, creativity, community, fanaticism, and cultural impact on the music industry and popular culture. They are a     ││ subject of widespread coverage in the mainstream media.                                                                            ││                                                                                                                                    ││ Critics have opined that Swift has redefined artist-fan relationships by establishing an intimate connection with Swifties. She    ││ has frequently engaged with, helped, credited, and prioritized her fans, who have offered unprecedented support and interest in    ││ her works irrespective of her wavering reception in the media. They continued to support Swift through her genre transitions,      ││ unexpected artistic pivots, and her highly publicized controversies such as the 2019 masters dispute, while instigating the        ││ political scrutiny of Ticketmaster that led to implementation of various laws and stimulated economic growth with the Eras Tour.   ││ Swift's releases, promotional efforts, and fashion have garnered attention for incorporating Easter eggs and clues that are        │......ConclusionsWe’ve demonstrated a simple InstructLab workflow that is scheduled by IBM LSFin a compute cluster equipped with GPUs.  As part of this example, LSF GPUscheduling and accounting for GPU workloads was highlighted. For organizationslooking to productionize InstructLab and where there is a pool of GPU equippedcompute resources, LSF provides an ideal way to manage demand from a usercommunity looking to run these intensive workloads.At the recent SC24 event, the demonstration went beyond what is shown in thisblog. It incorporated single click job submission via LSF Application Centerusing a custom template that was created for InstructLab to submits both thesynthetic data generation job, as well the training job with a single click.The demo environment was on IBM Cloud using instances equipped with Nvidia GPUs.The compute instances were automatically scaled up and down by the LSF resourceconnector. This will be the topic for a future blog.",
            "content_html": "<p><strong>Overview</strong></p><p>All the best for 2025! This blog looks back on a demo which I created for <a href=\"https://sc24.supercomputing.org\">SC24</a>last November to demonstrate InstructLab workflows running on an <a href=\"https://www.ibm.com/products/hpc-workload-management\">IBM LSF</a>cluster. Let’s begin with a bit of background. I’d like to thank MichaelSpriggs, STSM, IBM LSF for his contributions to this blog.</p><p>When I think of tuning, what immediately comes to my mind are visions of anexpert mechanic trying to extract the most from an engine. This blog isfocused on an entirely different type of tuning, AI model tuning. Like tuningan engine, AI model tuning can be used to ensure a better fit for a given AImodel for your business.</p><p>Released by IBM and Red Hat in May 2024, <a href=\"https://research.ibm.com/blog/instruct-lab\">InstructLab</a> is an open-source projectwhich provides the ability to fine-tune LLMs by adding skills and knowledge,without having to retrain the model from scratch. InstructLab can run onresource-constrained systems such as laptops, but also supports GPUs. Much hasbeen written about InstructLab and this blog is not intended to provide anin-depth look at InstructLab. Rather, the objective here is to demonstrate howInstructLab workloads can be distributed and managed in a high-performancecomputing cluster with GPUs using the IBM LSF workload scheduler. Recently, IBMpublished a paper describing the infrastructure used to train the Granite familyof AI foundation models. The paper describes the Vela and Blue Vela environmentsin detail. In particular, the Blue Vela environment is built on a software stackusing Red Hat Enterprise Linux, IBM LSF and Storage Scale. Learn more in thedetailed paper <a href=\"https://arxiv.org/abs/2407.05467\">here</a>.</p><p>The demo workflow consists of two LSF jobs. The first job generates syntheticdata, which is used to teach the LLM new skills or knowledge. The second job,which depends upon the successful completion of the first, is the training job,where the new skills or knowledge are incorporated into an existing base model.A simple LSF job dependency is used to ensure the training job only runs afterthe successful completion of the synthetic data generation step.</p><p>The environment used is equipped with Nvidia GPUs.  InstructLab jobs will berun with the options for GPU support, and the jobs will be submitted to LSFwith the appropriate GPU scheduling directives. Furthermore, it is assumed thatthe users' $HOME directory is available on all hosts in the cluster. Note that Irequire neither root access, nor a user account that is an LSF administrator, toinstall and use InstructLab on the LSF cluster.</p><p><strong>Configuration</strong></p><p>The HPC cluster is configured as follows:</p><ul><li>Red Hat Enterprise Linux v8.8</li><li>IBM LSF v10.0.1.15</li><li>InstructLab v0.19.4</li><li>Miniforge v3 (24.9.0-0)</li><li>NVIDIA CUDA v12.6</li><li>Compute nodes are equipped with 8 x Nvidia H100 GPUs</li></ul><p><strong>Install InstructLab</strong></p><ol><li>Log in to a compute node in the LSF cluster equipped with GPUs. If ssh accessis disabled to compute nodes, then submit an interactive LSF batch job. This jobrequests 8 GPUs on a single system and will set them to exclusive executionmode.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bsub -Is -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" bash</code></pre></div><ol start=\"2\"><li>Install and set up a Conda environment. This will enable you to install aself-contained Conda environment for your user account with the necessaryPython version needed for InstructLab. Miniforge is installed in the defaultlocation and the option to update the users shell profile to start the Condaenvironment are selected. We assume here a shared $HOME directory.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ cd $HOME$ curl -L -O \"https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-$(uname)-$(uname -m).sh\"$ bash Miniforge3-$(uname)-$(uname -m).sh</code></pre></div><ol start=\"3\"><li>Before proceeding, you must logout and log back in to activate theenvironment. Next, a Conda environment is created with name <em>my_env</em>. Here we’llspecify Python v3.11, which is a requirement for InstructLab.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">conda create --name my_env -c anaconda python=3.11conda activate my_env</code></pre></div><ol start=\"4\"><li>Next, install InstructLab. Here, version 0.19.4 of InstructLab is specified.This was the version of InstructLab available in the timeframe preceding theSC24 event. Follow the installation steps in the official InstructLabdocumentation <a href=\"https://github.com/instructlab/instructlab?tab=readme-ov-file#-installing-ilab\">here</a>.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ pip install instructlab==0.19.4</code></pre></div><ol start=\"5\"><li>Next, perform the installation of InstructLab with Nvidia CUDA support. Thisis required for InstructLab to utilize the GPUs. Without this step, InstructLabwill run on the CPUs. Note that CUDA v12.6 is installed on the system and thevariables set below reflect this.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ export CMAKE_ARGS=\"-DLLAMA_CUBLAS=on -DCUDA_PATH=/usr/local/cuda-12.6 -DCUDAToolkit_ROOT=/usr/local/cuda-12.6 -DCUDAToolkit_INCLUDE_DIR=/usr/local/cuda-12/include -DCUDAToolkit_LIBRARY_DIR=/usr/local/cuda-12.6/lib64\"$ export PATH=/usr/local/cuda-12.6/bin:$PATH$ pip cache remove llama_cpp_python$ CMAKE_ARGS=\"-DLLAMA_CUDA=on -DLLAMA_NATIVE=off\" pip install 'instructlab[cuda]'$ pip install vllm@git+https://github.com/opendatahub-io/vllm@v0.6.2</code></pre></div><p><strong>Configure InstructLab</strong></p><ol><li>With the installation of InstructLab complete, the next step is to run theinitialization. This will setup paths to models, taxonomy repo as well as theGPU configuration.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ ilab config init</code></pre></div><ol start=\"2\"><li>By default InstructLab stores models, training checkpoints and other fileswithin <em>~/.cache</em> and <em>~/.local/share/instructlab</em>. If you have limited storagecapacity available in $HOME, then you may opt to disable training checkpointfiles. This can be done by setting the following option in <em>~/.config/instructlab/config.yaml</em> as follows.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">train:  checkpoint_at_epoch: false</code></pre></div><ol start=\"3\"><li>Next, we download the required models. The ilab model list command can beused to list the models which are available. Note that a <a href=\"https://huggingface.co\">HuggingFace</a> token isrequired to download certain models. Please set HF_TOKEN in the environmentwith the appropriate token.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ export HF_TOKEN=&lt;HuggingFace token&gt;$ ilab model download$ ilab model download --repository=instructlab/granite-7b-lab$ ilab model list+--------------------------------------+---------------------+---------+| Model Name                           | Last Modified       | Size    |+--------------------------------------+---------------------+---------+| instructlab/granite-7b-lab           | 2024-12-27 20:37:29 | 12.6 GB || mistral-7b-instruct-v0.2.Q4_K_M.gguf | 2024-12-27 16:55:46 | 4.1 GB  || merlinite-7b-lab-Q4_K_M.gguf         | 2024-12-27 16:48:39 | 4.1 GB  |+--------------------------------------+---------------------+---------+</code></pre></div><p><strong>Generate synthetic data &amp; AI model training</strong></p><p>Next, is the synthetic data generation step, which will be executed on GPUs.This step is a prerequisite to teaching the LLM new skills/knowledge viatraining.</p><ol><li><p>Here we use example knowledge from the InstructLab github about Taylor Swiftfans, who are known as “Swifties”. This is timely because Taylor Swift recentlywrapped up 6 concerts in Toronto, Canada, where I happen to be based. Copyattribution.txt and qna.yaml from the following <a href=\"https://github.com/mairin/taxonomy/tree/swifties/knowledge/arts/music/fandom/swifties\">location</a>.</p></li><li><p>By default, the InstructLab taxonomy is found in <em>~/.local/share/instructlab/taxonomy</em>. Here we create the directories fandom/swifties under <em>~/.local/share/instructlab/taxonomy/knowledge/arts/fandom</em> and copy the files from step 1 intothis location.</p></li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ mkdir -p ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties$ cp &lt;path_to&gt;/attribution.txt ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties$ cp &lt;path_to&gt;/qna.yaml ~/.local/share/instructlab/taxonomy/knowledge/arts/fandom/swifties</code></pre></div><ol start=\"3\"><li>With the Swifties taxonomy in place, check for any syntax errors with thecommand <em>ilab taxonomy diff</em>. It should report that the taxonomy is valid ifthere are no syntax errors.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ ilab taxonomy diffknowledge/arts/fandom/swifties/qna.yamlTaxonomy in /u/gsamu/.local/share/instructlab/taxonomy is valid :)</code></pre></div><ol start=\"4\"><li>With the taxonomy in place and having confirmed that the syntax is valid,it’s now time to run the synthetic data generation job through LSF. Here we willrequest 8 GPUs on a single server in exclusive execution mode. For theInstructLab ilab command, specify the <em>&ndash;gpus 8 and &ndash;pipeline full</em> options.Standard output is written to the $HOME/job-output with filename specification&lt;LSF_JOBID&gt;.out. The $HOME/job-output directory must already exist.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ mkdir -p $HOME/job-output$ bsub -o $HOME/job-output/%J.out -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" ilab data generate --pipeline full --gpus 8Job &lt;1131&gt; is submitted to default queue &lt;normal&gt;.</code></pre></div><ol start=\"5\"><li>During job execution, the LSF <em>bpeek</em> command can be used to monitor the jobstandard output.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bpeek -f 1131 &lt;&lt; output from stdout &gt;&gt;INFO 2025-01-02 09:51:29,503 numexpr.utils:146: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check \"NUMEXPR_MAX_THREADS\" environment variable.INFO 2025-01-02 09:51:29,504 numexpr.utils:149: Note: NumExpr detected 96 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 16.INFO 2025-01-02 09:51:29,504 numexpr.utils:162: NumExpr defaulting to 16 threads.INFO 2025-01-02 09:51:30,038 datasets:59: PyTorch version 2.3.1 available.INFO 2025-01-02 09:51:31,226 instructlab.model.backends.llama_cpp💯 Trying to connect to model server at http://127.0.0.1:8000/v1WARNING 2025-01-02 09:51:56,356 instructlab.data.generate:270: Disabling SDG batching - unsupported with llama.cpp servingGenerating synthetic data using 'full' pipeline, '/u/gsamu/.cache/instructlab/models/mistral-7b-instruct-v0.2.Q4_K_M.gguf' model, '/u/gsamu/.local/share/instructlab/taxonomy' taxonomy, against http://127.0.0.1:55779/v1 serverINFO 2025-01-02 09:51:56,861 instructlab.sdg.generate_data:356: Synthesizing new instructions. If you aren't satisfied with the generated instructions, interrupt training (Ctrl-C) and try adjusting your YAML files. Adding more examples may help.INFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:153: Running pipeline single-threadedINFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:197: Running block: duplicate_document_colINFO 2025-01-02 09:51:56,872 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3'],    num_rows: 35})INFO 2025-01-02 09:51:58,286 instructlab.sdg.llmblock:51: LLM server supports batched inputs: FalseINFO 2025-01-02 09:51:58,286 instructlab.sdg.pipeline:197: Running block: gen_spellcheckINFO 2025-01-02 09:51:58,286 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document'],    num_rows: 35})/u/gsamu/miniforge3/envs/my_env/lib/python3.11/site-packages/llama_cpp/llama.py:1054: RuntimeWarning: Detected duplicate leading \"&lt;s&gt;\" in prompt, this will likely reduce response quality, consider removing it...  warnings.warn(INFO 2025-01-02 09:57:42,264 instructlab.sdg.pipeline:197: Running block: flatten_auxiliary_columnsINFO 2025-01-02 09:57:42,264 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'base_document', 'spellcheck'],    num_rows: 35})INFO 2025-01-02 09:57:42,279 instructlab.sdg.pipeline:197: Running block: rename_to_document_columnINFO 2025-01-02 09:57:42,279 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'dataset_type', 'corrected_document'],    num_rows: 70})INFO 2025-01-02 09:57:42,282 instructlab.sdg.pipeline:197: Running block: gen_knowledgeINFO 2025-01-02 09:57:42,282 instructlab.sdg.pipeline:198: Dataset({    features: ['icl_document', 'raw_document', 'document_outline', 'domain', 'icl_query_1', 'icl_query_2', 'icl_query_3', 'icl_response_1', 'icl_response_2', 'icl_response_3', 'dataset_type', 'document'],    num_rows: 70})……</code></pre></div><ol start=\"6\"><li>During the runtime of the job, it’s possible to view GPU related metricsusing the LSF <em>lsload</em> and <em>bhosts</em> commands. First, we need to identify the hostwhere the job has been dispatched to using the LSF bjobs command. In this casethe job was dispatched to host <em>p1-r01-n4</em>. Note that details GPU accountingmetrics are available once the job runs to completion.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bjobs -wJOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME1131    gsamu   RUN   normal     rmf-login-1 p1-r01-n4   ilab data generate --pipeline full --gpus 8 Jan  2 14:51$ lsload -w -gpu p1-r01-n4HOST_NAME                 status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physicalp1-r01-n4                     ok     8                 2%                7%              8$ bhosts -w -gpu p1-r01-n4HOST_NAME            GPU_ID                MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV p1-r01-n4                 0   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          1   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          2   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          3   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          4   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          5   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          6   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0                          7   NVIDIAH10080GBHBM3        2G        0G      1      1      0      0</code></pre></div><ol start=\"7\"><li>After job completion, it’s possible to view details about the job includingGPU utilization which LSF collects by leveraging NVIDIA DCGM. These metrics areavailable upon job completion using both the LSF <em>bhist</em> and <em>bjobs</em> commands.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bhist -l -gpu 1131Job &lt;1131&gt;, User &lt;gsamu&gt;, Project &lt;default&gt;, Command &lt;ilab data generate --pipe                          line full --gpus 8&gt;Thu Jan  2 14:51:23 2025: Submitted from host &lt;rmf-login-1&gt;, to Queue &lt;normal&gt;,                           CWD &lt;$HOME&gt;, Output File &lt;/u/gsamu/job-output/%J.out                          &gt;, Requested Resources &lt;span[hosts=1]&gt;, Requested GPU                           &lt;num=8:j_exclusive=yes&gt;;Thu Jan  2 14:51:24 2025: Dispatched 1 Task(s) on Host(s) &lt;p1-r01-n4&gt;, Allocate                          d 1 Slot(s) on Host(s) &lt;p1-r01-n4&gt;, Effective RES_REQ                           &lt;select[((ngpus&gt;0)) &amp;&amp; (type == local)] order[r15s:p                          g] rusage[ngpus_physical=8.00] span[hosts=1] &gt;;Thu Jan  2 14:51:25 2025: Starting (Pid 3095851);Thu Jan  2 14:51:25 2025: External Message \"p1-r01-n4:gpus=0,1,2,3,4,5,6,7;EFFE                          CTIVE GPU REQ: num=8:mode=shared:mps=no:j_exclusive=y                          es:gvendor=nvidia;\" was posted from \"gsamu\" to messag                          e box 0;Thu Jan  2 14:51:26 2025: Running with execution home &lt;/u/gsamu&gt;, Execution CWD                           &lt;/u/gsamu&gt;, Execution Pid &lt;3095851&gt;;Thu Jan  2 16:08:05 2025: Done successfully. The CPU time used is 4624.0 second                          s;                          HOST: p1-r01-n4; CPU_TIME: 4624 seconds                                                        GPU ID: 0                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 579704 Joules                                  SM Utilization (%): Avg 9, Max 15, Min 0                                  Memory Utilization (%): Avg 2, Max 100, Min 0                                  Max GPU Memory Used: 1956642816 bytes                              GPU ID: 1                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 503956 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1767899136 bytes                              GPU ID: 2                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 501754 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1784676352 bytes                              GPU ID: 3                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 525195 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 54, Min 0                                  Max GPU Memory Used: 1767899136 bytes                              GPU ID: 4                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 525331 Joules                                  SM Utilization (%): Avg 7, Max 12, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1767899136 bytes                              GPU ID: 5                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 502416 Joules                                  SM Utilization (%): Avg 7, Max 11, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1784676352 bytes                              GPU ID: 6                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 508720 Joules                                  SM Utilization (%): Avg 7, Max 12, Min 0                                  Memory Utilization (%): Avg 2, Max 5, Min 0                                  Max GPU Memory Used: 1784676352 bytes                              GPU ID: 7                                  Total Execution Time: 4597 seconds                                  Energy Consumed: 491041 Joules                                  SM Utilization (%): Avg 6, Max 12, Min 0                                  Memory Utilization (%): Avg 2, Max 4, Min 0                                  Max GPU Memory Used: 1933574144 bytesGPU Energy Consumed: 4138117.000000 JoulesThu Jan  2 16:08:05 2025: Post job process done successfully;GPU_ALLOCATION: HOST             TASK GPU_ID  GI_PLACEMENT/SIZE    CI_PLACEMENT/SIZE    MODEL        MTOTAL  FACTOR MRSV    SOCKET NVLINK/XGMI                       p1-r01-n4        0    0       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    1       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    2       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    3       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    4       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    5       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    6       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    7       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                               MEMORY USAGE:MAX MEM: 2 Gbytes;  AVG MEM: 1 Gbytes; MEM Efficiency: 0.00%CPU USAGE:CPU PEAK: 1.69 ;  CPU PEAK DURATION: 52 second(s)CPU AVERAGE EFFICIENCY: 100.69% ;  CPU PEAK EFFICIENCY: 169.23%Summary of time in seconds spent in various states by  Thu Jan  2 16:08:05 2025  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL  1        0        4601     0        0        0        4602 </code></pre></div><ol start=\"8\"><li><p>When the synthetic data generation job completes, it’s output can be viewedat <em>~/job-output/<!-- raw HTML omitted -->.out</em>. The synthetic data sets will comprise files inthe directory <em>~/.local/share/instructlab/datasets</em>. These files will be named*skills_train_msgs_*.jsonl* and *knowledge_train_msgs_*.jsonl*.</p></li><li><p>With the synthetic data generation step complete, it’s now time to run thetraining. We first set 2 environment variables to point to the followingfiles:  <em>~/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl</em>  and <em>~./.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl</em>.</p></li></ol><p>Afterward, we submit the training job to LSF requesting 8 GPUs and with ilaboptions <em>&ndash;pipeline accelerated</em>, <em>&ndash;gpus 8</em>, <em>&ndash;device cuda</em> and<em>&ndash;data-path</em> pointing to the two above data files that were produced in thesynthetic data generation step.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ export SKILLS_PATH=/u/gsamu/.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl$ export KNOWLEDGE_PATH=/u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl$ bsub -o $HOME/job-output/%J.out -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" ilab model train --pipeline accelerated --data-path $SKILLS_PATH --data-path $KNOWLEDGE_PATH --device cuda --gpus 8Job &lt;1135&gt; is submitted to default queue &lt;normal&gt;.</code></pre></div><ol start=\"10\"><li>During job execution, the LSF <em>bpeek</em> command can be used to monitor thejob standard output.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bpeek -f 1135&lt;&lt; output from stdout &gt;&gt;LoRA is disabled (rank=0), ignoring all additional LoRA args[2025-01-02 12:52:04,359] [INFO] [real_accelerator.py:222:get_accelerator] Setting ds_accelerator to cuda (auto detect)INFO 2025-01-02 12:52:09,061 numexpr.utils:146: Note: detected 96 virtual cores but NumExpr set to maximum of 64, check \"NUMEXPR_MAX_THREADS\" environment variable.INFO 2025-01-02 12:52:09,061 numexpr.utils:149: Note: NumExpr detected 96 cores but \"NUMEXPR_MAX_THREADS\" not set, so enforcing safe limit of 16.INFO 2025-01-02 12:52:09,061 numexpr.utils:162: NumExpr defaulting to 16 threads.INFO 2025-01-02 12:52:09,304 datasets:59: PyTorch version 2.3.1 available.You are using the default legacy behaviour of the &lt;class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'&gt;. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.INFO 2025-01-02 12:52:09,653 root:617: Special tokens: eos: [32000], pad: [32001], bos: [32005], system: [32004], user: [32002], assistant: [32003]INFO 2025-01-02 12:52:09,923 root:617: number of dropped samples: 0 -- out of 641 data arguments are:{\"data_path\":\"/u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl\",\"data_output_path\":\"/u/gsamu/.local/share/instructlab/internal\",\"max_seq_len\":4096,\"model_path\":\"/u/gsamu/.cache/instructlab/models/instructlab/granite-7b-lab\",\"chat_tmpl_path\":\"/u/gsamu/miniforge3/envs/my_env/lib/python3.11/site-packages/instructlab/training/chat_templates/ibm_generic_tmpl.py\",\"num_cpu_procs\":16}tokenizing the dataset with /u/gsamu/.cache/instructlab/models/instructlab/granite-7b-lab tokenizer...ten largest length percentiles:quantile 90th: 1459.0quantile 91th: 1466.0quantile 92th: 1469.6000000000001quantile 93th: 1478.2quantile 94th: 1483.0quantile 95th: 1488.0quantile 96th: 1497.1999999999998quantile 97th: 1516.5999999999997quantile 98th: 1540.6000000000001quantile 99th: 1656.0000000000016quantile 100th: 2578.0at 4096 max sequence length, the number of samples to be dropped is 0(0.00% of total)quantile 0th: 368.0quantile 1th: 393.0quantile 2th: 411.2quantile 3th: 421.2quantile 4th: 427.2quantile 5th: 442.0quantile 6th: 604.4quantile 7th: 631.8quantile 8th: 653.8000000000001quantile 9th: 679.8quantile 10th: 742.0at 20 min sequence length, the number of samples to be dropped is 0checking the validity of the samples...Categorizing training data type...unmasking the appropriate message content... Samples Previews...……</code></pre></div><ol start=\"11\"><li>During the runtime of the training job, we can observe some GPU utilizationinformation using the LSF lsload and bhosts commands.  First we need to identifythe server on which the training job is running. This is done using the bjobscommand and checking for the execution host of the job.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bjobs -wJOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME1135    gsamu   RUN   normal     rmf-login-1 p1-r01-n1   ilab model train --pipeline accelerated --data-path /u/gsamu/.local/share/instructlab/datasets/skills_train_msgs_2025-01-02T09_51_56.jsonl --data-path /u/gsamu/.local/share/instructlab/datasets/knowledge_train_msgs_2025-01-02T09_51_56.jsonl --device cuda --gpus 8 Jan  2 17:51$ lsload -w -gpu p1-r01-n1HOST_NAME                 status ngpus gpu_shared_avg_mut gpu_shared_avg_ut ngpus_physicalp1-r01-n1                     ok     8                 0%               22%              8$ bhosts -w -gpu p1-r01-n1HOST_NAME            GPU_ID                MODEL     MUSED      MRSV  NJOBS    RUN   SUSP    RSV p1-r01-n1                 0   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          1   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          2   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          3   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          4   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          5   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          6   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0                          7   NVIDIAH10080GBHBM3       10G        0G      1      1      0      0</code></pre></div><ol start=\"12\"><li>Once the job is complete, detailed GPU accounting can again be viewed usingthe LSF <em>bhist</em> command as follows below.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bhist -l -gpu 1135Job &lt;1135&gt;, User &lt;gsamu&gt;, Project &lt;default&gt;, Command &lt;ilab model train --pipeli                          ne accelerated --data-path /u/gsamu/.local/share/inst                          ructlab/datasets/skills_train_msgs_2025-01-02T09_51_5                          6.jsonl --data-path /u/gsamu/.local/share/instructlab                          /datasets/knowledge_train_msgs_2025-01-02T09_51_56.js                          onl --device cuda --gpus 8&gt;Thu Jan  2 17:51:48 2025: Submitted from host &lt;rmf-login-1&gt;, to Queue &lt;normal&gt;,                           CWD &lt;$HOME/.local/share/instructlab/checkpoints&gt;, Ou                          tput File &lt;/u/gsamu/job-output/%J.out&gt;, Requested Res                          ources &lt;span[hosts=1]&gt;, Requested GPU &lt;num=8:j_exclus                          ive=yes&gt;;Thu Jan  2 17:51:48 2025: Dispatched 1 Task(s) on Host(s) &lt;p1-r01-n1&gt;, Allocate                          d 1 Slot(s) on Host(s) &lt;p1-r01-n1&gt;, Effective RES_REQ                           &lt;select[((ngpus&gt;0)) &amp;&amp; (type == local)] order[r15s:p                          g] rusage[ngpus_physical=8.00] span[hosts=1] &gt;;Thu Jan  2 17:51:49 2025: Starting (Pid 3462241);Thu Jan  2 17:51:49 2025: Running with execution home &lt;/u/gsamu&gt;, Execution CWD                           &lt;/u/gsamu/.local/share/instructlab/checkpoints&gt;, Exe                          cution Pid &lt;3462241&gt;;Thu Jan  2 17:51:49 2025: External Message \"p1-r01-n1:gpus=0,1,2,3,4,5,6,7;EFFE                          CTIVE GPU REQ: num=8:mode=shared:mps=no:j_exclusive=y                          es:gvendor=nvidia;\" was posted from \"gsamu\" to messag                          e box 0;Thu Jan  2 17:57:56 2025: Done successfully. The CPU time used is 3024.0 second                          s;                          HOST: p1-r01-n1; CPU_TIME: 3024 seconds                                                        GPU ID: 0                                  Total Execution Time: 365 seconds                                  Energy Consumed: 98890 Joules                                  SM Utilization (%): Avg 20, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 62, Min 0                                  Max GPU Memory Used: 53022294016 bytes                              GPU ID: 1                                  Total Execution Time: 365 seconds                                  Energy Consumed: 97697 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 58, Min 0                                  Max GPU Memory Used: 53087305728 bytes                              GPU ID: 2                                  Total Execution Time: 365 seconds                                  Energy Consumed: 94820 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 62, Min 0                                  Max GPU Memory Used: 53221523456 bytes                              GPU ID: 3                                  Total Execution Time: 365 seconds                                  Energy Consumed: 98014 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 59, Min 0                                  Max GPU Memory Used: 53041168384 bytes                              GPU ID: 4                                  Total Execution Time: 365 seconds                                  Energy Consumed: 99246 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 60, Min 0                                  Max GPU Memory Used: 53045362688 bytes                              GPU ID: 5                                  Total Execution Time: 365 seconds                                  Energy Consumed: 94952 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 65, Min 0                                  Max GPU Memory Used: 53047459840 bytes                              GPU ID: 6                                  Total Execution Time: 365 seconds                                  Energy Consumed: 98227 Joules                                  SM Utilization (%): Avg 53, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 63, Min 0                                  Max GPU Memory Used: 53127151616 bytes                              GPU ID: 7                                  Total Execution Time: 365 seconds                                  Energy Consumed: 94582 Joules                                  SM Utilization (%): Avg 52, Max 100, Min 0                                  Memory Utilization (%): Avg 9, Max 65, Min 0                                  Max GPU Memory Used: 53481570304 bytesGPU Energy Consumed: 776428.000000 JoulesThu Jan  2 17:57:56 2025: Post job process done successfully;GPU_ALLOCATION: HOST             TASK GPU_ID  GI_PLACEMENT/SIZE    CI_PLACEMENT/SIZE    MODEL        MTOTAL  FACTOR MRSV    SOCKET NVLINK/XGMI                       p1-r01-n1        0    0       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    1       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    2       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    3       -                    -                    NVIDIAH10080 80G     9.0    0G      0      -                                                 0    4       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    5       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    6       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                                                 0    7       -                    -                    NVIDIAH10080 80G     9.0    0G      1      -                               MEMORY USAGE:MAX MEM: 104 Gbytes;  AVG MEM: 16 Gbytes; MEM Efficiency: 0.00%CPU USAGE:CPU PEAK: 17.86 ;  CPU PEAK DURATION: 49 second(s)CPU AVERAGE EFFICIENCY: 856.60% ;  CPU PEAK EFFICIENCY: 1785.71%Summary of time in seconds spent in various states by  Thu Jan  2 17:57:56 2025  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL  0        0        368      0        0        0        368         </code></pre></div><ol start=\"13\"><li>Finally, with the model successfully trained, let’s chat with the new modelto check the result. Here’s we’ll pose it Swiftie specific questions. Note thatthe output from the training is written to <em>~/.local/share/instructlab/checkpoints/hf_format</em>. We’ll take the model from the latest checkpoint directory that wascreated. Here again, we launch the model chat job via LSF as an interactivebatch job (i.e. <em>bsub -Is</em>).</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ grep hf_format 1135.outModel saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_886Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_1776Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_2658Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_3546Model saved in /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_4435</code></pre></div><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bsub -Is -R \"span[hosts=1]\" -gpu \"num=8:j_exclusive=yes\" ilab model chat --model /u/gsamu/.local/share/instructlab/checkpoints/hf_format/samples_4435Job &lt;1146&gt; is submitted to default queue &lt;interactive&gt;.&lt;&lt;Waiting for dispatch ...&gt;&gt;&lt;&lt;Starting on p1-r01-n2&gt;&gt;INFO 2025-01-02 15:06:07,600 instructlab.model.backends.vllm:105: Trying to connect to model server at http://127.0.0.1:8000/v1INFO 2025-01-02 15:06:08,876 instructlab.model.backends.vllm:308: vLLM starting up on pid 3744375 at http://127.0.0.1:41531/v1INFO 2025-01-02 15:06:08,876 instructlab.model.backends.vllm:114: Starting a temporary vLLM server at http://127.0.0.1:41531/v1INFO 2025-01-02 15:06:08,876 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 1/120INFO 2025-01-02 15:06:12,244 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 2/120INFO 2025-01-02 15:06:15,614 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 3/120INFO 2025-01-02 15:06:18,801 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 4/120INFO 2025-01-02 15:06:21,952 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 5/120INFO 2025-01-02 15:06:25,391 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 6/120INFO 2025-01-02 15:06:28,638 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 7/120INFO 2025-01-02 15:06:32,103 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 8/120INFO 2025-01-02 15:06:35,296 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 9/120INFO 2025-01-02 15:06:38,616 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 10/120INFO 2025-01-02 15:06:42,015 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 11/120INFO 2025-01-02 15:06:45,435 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 12/120INFO 2025-01-02 15:06:48,679 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 13/120INFO 2025-01-02 15:06:52,025 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 14/120INFO 2025-01-02 15:06:55,317 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 15/120INFO 2025-01-02 15:06:58,604 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 16/120INFO 2025-01-02 15:07:01,927 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 17/120INFO 2025-01-02 15:07:05,287 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 18/120INFO 2025-01-02 15:07:08,763 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 19/120INFO 2025-01-02 15:07:12,131 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 20/120INFO 2025-01-02 15:07:15,476 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 21/120INFO 2025-01-02 15:07:18,881 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 22/120INFO 2025-01-02 15:07:22,203 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 23/120INFO 2025-01-02 15:07:25,599 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 24/120INFO 2025-01-02 15:07:28,991 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 25/120INFO 2025-01-02 15:07:32,234 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 26/120INFO 2025-01-02 15:07:35,714 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 27/120INFO 2025-01-02 15:07:38,974 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 28/120INFO 2025-01-02 15:07:42,265 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 29/120INFO 2025-01-02 15:07:45,582 instructlab.model.backends.vllm:129: Waiting for the vLLM server to start at http://127.0.0.1:41531/v1, this might take a moment... Attempt: 30/120INFO 2025-01-02 15:07:45,586 instructlab.model.backends.vllm:136: vLLM engine successfully started at http://127.0.0.1:41531/v1╭────────────────────────────────────────────────────────────── system ──────────────────────────────────────────────────────────────╮│ Welcome to InstructLab Chat w/ SAMPLES_4435 (type /h for help)                                                                     │╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯&gt;&gt;&gt; Tell me everything you know about Swifties.                                                                           [S][default]╭─────────────────────────────────────────────────────────── samples_4435 ───────────────────────────────────────────────────────────╮│ Swifties are the fandom of the American singer-songwriter Taylor Swift.                                                            ││ Regarded by journalists as one of the largest, most devoted, and influential fan bases, Swifties are known for their high levels   ││ of participation, creativity, community, fanaticism, and cultural impact on the music industry and popular culture. They are a     ││ subject of widespread coverage in the mainstream media.                                                                            ││                                                                                                                                    ││ Critics have opined that Swift has redefined artist-fan relationships by establishing an intimate connection with Swifties. She    ││ has frequently engaged with, helped, credited, and prioritized her fans, who have offered unprecedented support and interest in    ││ her works irrespective of her wavering reception in the media. They continued to support Swift through her genre transitions,      ││ unexpected artistic pivots, and her highly publicized controversies such as the 2019 masters dispute, while instigating the        ││ political scrutiny of Ticketmaster that led to implementation of various laws and stimulated economic growth with the Eras Tour.   ││ Swift's releases, promotional efforts, and fashion have garnered attention for incorporating Easter eggs and clues that are        │......</code></pre></div><p><strong>Conclusions</strong></p><p>We’ve demonstrated a simple InstructLab workflow that is scheduled by IBM LSFin a compute cluster equipped with GPUs.  As part of this example, LSF GPUscheduling and accounting for GPU workloads was highlighted. For organizationslooking to productionize InstructLab and where there is a pool of GPU equippedcompute resources, LSF provides an ideal way to manage demand from a usercommunity looking to run these intensive workloads.</p><p>At the recent SC24 event, the demonstration went beyond what is shown in thisblog. It incorporated single click job submission via LSF Application Centerusing a custom template that was created for InstructLab to submits both thesynthetic data generation job, as well the training job with a single click.The demo environment was on IBM Cloud using instances equipped with Nvidia GPUs.The compute instances were automatically scaled up and down by the LSF resourceconnector. This will be the topic for a future blog.</p>",
            "url": "https://hpc.social/personal-blog/2025/fine-tuning-ai-models-with-instructlab-under-ibm-lsf/",
            
            
            
            
            
            "date_published": "2025-01-06T19:36:24-07:00",
            "date_modified": "2025-01-06T19:36:24-07:00",
            
                "author": "Ramblings of a supercomputing enthusiast."
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2025/surfing-the-singularity-please-hold-for-the-next-available-agent/",
            "title": "Surfing the Singularity - \"Please hold for the next available agent\"",
            "summary": null,
            "content_text": "&lt;div style=\"text-align: left;\"&gt;&lt;/div&gt;Our imaginations, having been so stimulated by the \"innovation trigger\" of early interactions with ChatGPT and its LLM kin, having experienced the illusion of the algorithm reading your mind, we have now firmly entered into the period of inflated expectations. Any day now we expect a knock on the door to be informed by some HAL Junior that not only are we now out of a job, we've also got 20 minutes to evacuate the premise before its bulldozed to make way for another solar farm and data center. AGI is only just one product announcement away, or maybe two, but certainly three at most... Nose Deep There is a strong desire on the part of companies trafficking in AI to generate not just chatbot hallucinations but also customers for real business use cases, meaning revenue, and now. To do that we're going to need hardware, fast, lots of it, and gigajoules to power it. So AWS buys a new data center in PA adjacent to a 2.5GW nuclear power plant.[1] Not to be outdone Microsoft re-revs up Three Mile Island (albeit with a catchy rebranding laughable by 1970's standards), with 100% of the power going to their regional AI data centers.[2] Three Mile Island nuclear power plant, aka the \"Crane Clean Energy Center\".  After vigorous expectations the trough of disillusionment will soon follow. Already Microsoft hints that demand for AI-oriented chips is waning.[3] Practical, as you'll have a hard time getting them anyway - the data-center grade GPU chips on which AI computation rely are in short supply - NVIDIA via their TSMC outsource manufacturing partner is fully booked for Blackwell GPU orders for the next 12 months.[4] AWS has recently announced to customers (like me) new limitations on availability of certain NVIDIA GPU instances. (Consider also that AI competes with crypto for these scarce GPUs.) Intel suggests it will ship mass quantities of chips for AI-ready PCs and other mobile devices in 2025, but the stock traders are not yet buying it, with the stock currently fallen over 50% year-over-year. In the end, and as evidenced by the long term investments, we of course expect the march of techno-progress to continue, but in the short run, aligning expectations with reality may remain a challenge.The August 2024 Gartner Hype Cycle for Emerging Technologies. Generative AI - weee! [5]What does OpenAI say about all this? First, the desire to be non-profit has bumped up against the realities of scaling up the models. Will they continue to scale up, yielding better and deeper performance on the road to artificial general intelligence simply by scaling up, or will they hit a theoretical wall?[6] Sam Altman says succinctly: \"there is no wall\".[7] The nuclear-powered race is on, be it sustainable or not.\"Your wait time is now less than...\"But as we argued in the last blog [8], we don't need dystopia-inducing super-human AGI in order to make productive and disruptive use of artificial intelligence technologies - a domain-tuned artificial capable intelligence (ACI) is enough.[9] Or a collaborating set of them. OpenAI's strategic product roadmap is more than a little vague [10], but in theory after chatbots capable of basic reasoning comes the age of agents - think: allowing Alexa to auto-restock your pantry via a hotline to Bezos when it overhears you say you're low on sugar. Such \"AI\" does such a good job doing basic thinks like, oh I dunno, controlling the lights in your home now, what could go wrong?! Truth is, today's LLMs perform only so-so on standardized benchmarks, and while they improve all the time [11], the current state of the art is not yet ready to be trusted and at times seems like snake oil.[12]Today's agents tend to be domain-specific and tailored to narrow purpose - Salesforce.com agents for common customer interactions, ServiceNow agents helping the human agent perform repetitive or summary tasks in handling case loads, but not replacing the human.[13,14] Google Gemini can add events to your calendar, help you plan travel, but is not yet trusted to actually borrow your credit card and book it. Keeping the human-in-the-loop will remain for now, as a stepping stone to full automation.If you visit agent marketplaces like Agent.ai or SwarmZero.ai, you'll see on the order of hundreds of agents available to handle what are largely small, mundane, and repetitive tasks. There are similar domain agent marketplaces on OpenAI's site, Anthropic's, GitHub, Hugging Face, and more. Let's go along with the current norm and define \"assistants\" as gaggles of agents loosely collaborating to accomplish more complex tasks, perhaps as part of a hybrid AI-human team or for some cases ultimately on behalf of the entire organization, and yet, still not requiring full-on AGI. (Consider what just one techno-savvy entrepreneur with a diverse collection of AI auto-orgs might do.)The missing elements are reliable agent accuracy, which yields trust, and the hardware and power to run it all. Trust, unfortunately in the near term, may play second fiddle to profit, as the AI snake oil is sold to companies and governments and ultimately end users, most of whom barely understand it.In fact, the scientists themselves barely understand it. The deep learning networks that power today's LLMs are generally black boxes, layers upon layers of neural networks, numeric weights and matrix computations, where its pretty difficult to tell where any given word, image fragment, or concept is held in the vast space of the model, and how with various feed-forward and back-propagation processes in the network it is used in computing responses.A GPT model formed by combining successive attention and neural net layers. Input comes in at the left, and its black boxes all the way down.[15]   Black box or not, as Sam Altman says, deep learning just works.[16] Sort of - AGI is unlikely without strong ontological and reasoning abilities and a tactile understanding of the physical world.[17] And deep learning itself is not without its problems. If the training data is biased, so will be the results. Trainers have to be alert to overfitting the model to the training data in a way that makes the model ineffective on new data. And implementors need better tools which help introspect and observe the model to provide verification, to illuminate the black box. Until then, any technology which cannot be understood is indistinguishable from magic.Hell-o OperatorAI is a broad term, encompassing many technologies, machine learning being just one of them, and deep learning based on neural networks being an even further niche. In many ways, given the black box nature of the solution, AI has become a substitute word for \"automation\", and/or \"program\", or \"algorithm\". And the ill-defined AI landscape is moving fast. Twelve months ago the buzz was about the emergence of the \"prompt engineer\" role in lieu of computer programmers, and today, not so much. Instead we now have thin but actionable (i.e. product-oriented) definitions like \"agent\" and \"assistant\" and a new suite of tools and cute icons to put on enterprise architecture diagrams. This is not to even mention the human and organizational impact of new agent-based workflows characterized by iterative, non-waterfall business processes - not something well understood or appreciated outside of software engineering circles.In this turbulent time, with vendors leapfrogging each other's capabilities and performance, there is no and cannot be any real standardization, no agreed abstractions on which to base a unifying orchestration layer. Move fast and break things, fix them later if they live long enough. Let the prototype knowingly become the short-lived product, and iterate, maybe. Think: sqrt of web time. Think: ChatGPT + IFTTT.[18] That is not an enterprise IT solution, nor one manageable for most individuals. That is a fine mess.Thankfully, we'll soon have AI assistants to fix it for us. - andy (linkedin: andygallojr)References[1] https://www.datacenterdynamics.com/en/news/aws-acquires-talens-nuclear-data-center-campus-in-pennsylvania/[2] https://www.datacenterdynamics.com/en/news/three-mile-island-nuclear-power-plant-to-return-as-microsoft-signs-20-year-835mw-ai-data-center-ppa/Readers unfamiliar with the nuclear accident at Three Mile Island in 1979 can read the summary here: https://en.wikipedia.org/wiki/Three_Mile_Island_accident[3] https://finance.yahoo.com/news/nvidia-stocks-correction-accelerated-since-020804144.html[4] https://www.smbom.com/news/14253[5] Gartner Hype Cycle for Emerging Technologies, August 2024, https://emt.gartnerweb.com/ngw/globalassets/en/newsroom/images/graphs/august_2024_ethc.png[6] \"The Computational Limits of Deep Learning\", https://arxiv.org/pdf/2007.05558[7] Sam Altman on X: \"there is no wall\", https://x.com/sama/status/1856941766915641580[8] Surfing the Singularity blog, https://surfthesing.blogspot.com/2024/12/surfing-singularity-coming-wave-book.html[9] \"The Coming Wave\", M. Suleyman, Crown Pub., 2023[10] https://www.theneurondaily.com/p/openais-leaked-agi-roadmap[11] 12 Days of OpenAI, Day 12: https://www.youtube.com/watch?v=SKBG1sqdyIU [12] \"AI Snake Oil\", Narayanan &amp; Kapoor, Princeton U. Press, 2024[13] https://www.salesforce.com/news/stories/einstein-sales-agents-announcement[14] https://www.servicenow.com/standard/resource-center/data-sheet/ds-virtual-agent.html[15] https://miro.medium.com/v2/0*-8c-MXmNvcvTLdHH.png We recommend the following video for those not familiar with this architecture:  https://youtu.be/KJtZARuO3JY?si=Muq2xRdSTaa9LMXb[16] https://ia.samaltman.com/[17] Yann LeCun on Lex Fridman podcast, https://www.youtube.com/watch?v=5t1vTLU7s40[18] https://ifttt.com/chatgpt",
            "content_html": "<div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p><br />&lt;div style=\"text-align: left;\"&gt;<br />&lt;/div&gt;</p><div><span style=\"font-family: verdana;\">Our imaginations, having been so stimulated by the \"innovation trigger\" of early interactions with ChatGPT and its LLM kin, having experienced the illusion of the algorithm reading your mind, we have now firmly entered into the period of inflated expectations. Any day now we expect a knock on the door to be informed by some HAL Junior that not only are we now out of a job, we've also got 20 minutes to evacuate the premise before its bulldozed to make way for another solar farm and data center. AGI is only just one product announcement away, or maybe two, but certainly three at most... </span></div><div><h3 style=\"text-align: left;\"><span style=\"font-family: verdana;\">Nose Deep </span></h3></div><div><span style=\"font-family: verdana;\">There is a strong desire on the part of companies trafficking in AI to generate not just chatbot hallucinations but also customers for real business use cases, meaning revenue, and now. To do that we're going to need hardware, fast, lots of it, and gigajoules to power it. So AWS buys a new data center in PA adjacent to a 2.5GW nuclear power plant.[1] Not to be outdone Microsoft re-revs up Three Mile Island (albeit with a catchy rebranding laughable by 1970's standards), with 100% of the power going to their regional AI data centers.[2] </span></div><div><span style=\"font-family: verdana;\"><br /></span><table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\"><tbody><tr><td style=\"text-align: center;\"><a href=\"https://media.datacenterdynamics.com/media/images/Constellation_Three_Mile_Island.width-358.png\" style=\"margin-left: auto; margin-right: auto;\"><img border=\"0\" height=\"280\" src=\"https://media.datacenterdynamics.com/media/images/Constellation_Three_Mile_Island.width-358.png\" width=\"400\" /></a></td></tr><tr><td class=\"tr-caption\" style=\"text-align: center;\"><span style=\"font-family: verdana;\">Three Mile Island nuclear power plant, aka the \"Crane Clean Energy Center\".<br /></span>  </td></tr></tbody></table></div><div><span style=\"font-family: verdana;\">After vigorous expectations the trough of disillusionment will soon follow. Already Microsoft hints that demand for AI-oriented chips is waning.[3] Practical, as you'll have a hard time getting them anyway - the data-center grade GPU chips on which AI computation rely are in short supply - NVIDIA via their TSMC outsource manufacturing partner is fully booked for Blackwell GPU orders for the next 12 months.[4] AWS has recently announced to customers (like me) new limitations on availability of certain NVIDIA GPU instances. (Consider also that AI competes with crypto for these scarce GPUs.) Intel suggests it will ship mass quantities of chips for AI-ready PCs and other mobile devices in 2025, but the stock traders are not yet buying it, with the stock currently fallen over 50% year-over-year. In the end, and as evidenced by the long term investments, we of course expect the march of techno-progress to continue, but in the short run, aligning expectations with reality may remain a challenge.</span></div><div><span style=\"font-family: verdana;\"><br /></span></div><div><table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\"><tbody><tr><td style=\"text-align: center;\"></td></tr><tr><td class=\"tr-caption\" style=\"text-align: center;\"><div style=\"font-family: verdana;\">The August 2024 Gartner Hype Cycle for Emerging Technologies. </div><div style=\"font-family: verdana;\">Generative AI - weee! [5]</div></td></tr></tbody></table></div><div><span style=\"font-family: verdana;\"><div style=\"text-align: center;\"><br /></div>What does OpenAI say about all this? First, the desire to be non-profit has bumped up against the realities of scaling up the models. Will they continue to scale up, yielding better and deeper performance on the road to artificial general intelligence simply by scaling up, or will they hit a theoretical wall?[6] Sam Altman says succinctly: \"there is no wall\".[7] The nuclear-powered race is on, be it sustainable or not.</span></div><div><span style=\"font-family: verdana;\"><br /></span><h3 style=\"text-align: left;\"><span style=\"font-family: verdana;\">\"Your wait time is now less than...\"</span></h3></div><div><span style=\"font-family: verdana;\">But as we argued in the last blog [8], we don't need dystopia-inducing super-human AGI in order to make productive and disruptive use of artificial intelligence technologies - a domain-tuned artificial capable intelligence (ACI) is enough.[9] Or a collaborating set of them. </span></div><div><span style=\"font-family: verdana;\"><br />OpenAI's strategic product roadmap is more than a little vague [10], but in theory after chatbots capable of basic reasoning comes the age of agents - think: allowing Alexa to auto-restock your pantry via a hotline to Bezos when it overhears you say you're low on sugar. Such \"AI\" does such a good job doing basic thinks like, oh I dunno, controlling the lights in your home now, what could go wrong?! Truth is, today's LLMs perform only so-so on standardized benchmarks, and while they improve all the time [11], the current state of the art is not yet ready to be trusted and at times seems like snake oil.[12]</span></div><div><span style=\"font-family: verdana;\"><br />Today's agents tend to be domain-specific and tailored to narrow purpose - Salesforce.com agents for common customer interactions, ServiceNow agents helping the human agent perform repetitive or summary tasks in handling case loads, but not replacing the human.[13,14] Google Gemini can add events to your calendar, help you plan travel, but is not yet trusted to actually borrow your credit card and book it. Keeping the human-in-the-loop will remain for now, as a stepping stone to full automation.</span></div><div><span style=\"font-family: verdana;\"><br />If you visit agent marketplaces like Agent.ai or SwarmZero.ai, you'll see on the order of hundreds of agents available to handle what are largely small, mundane, and repetitive tasks. There are similar domain agent marketplaces on OpenAI's site, Anthropic's, GitHub, Hugging Face, and more. Let's go along with the current norm and define \"assistants\" as gaggles of agents loosely collaborating to accomplish more complex tasks, perhaps as part of a hybrid AI-human team or for some cases ultimately on behalf of the entire organization, and yet, still not requiring full-on AGI. (Consider what just one techno-savvy entrepreneur with a diverse collection of AI auto-orgs might do.)</span></div><div><span style=\"font-family: verdana;\"><br />The missing elements are reliable agent accuracy, which yields trust, and the hardware and power to run it all. Trust, unfortunately in the near term, may play second fiddle to profit, as the AI snake oil is sold to companies and governments and ultimately end users, most of whom barely understand it.</span></div><div><span style=\"font-family: verdana;\"><br />In fact, the scientists themselves barely understand it. The deep learning networks that power today's LLMs are generally black boxes, layers upon layers of neural networks, numeric weights and matrix computations, where its pretty difficult to tell where any given word, image fragment, or concept is held in the vast space of the model, and how with various feed-forward and back-propagation processes in the network it is used in computing responses.</span></div><div><span style=\"font-family: verdana;\"><br /></span></div><div><table align=\"center\" cellpadding=\"0\" cellspacing=\"0\" class=\"tr-caption-container\" style=\"margin-left: auto; margin-right: auto;\"><tbody><tr><td style=\"text-align: center;\"><a href=\"https://miro.medium.com/v2/0*-8c-MXmNvcvTLdHH.png\" style=\"margin-left: auto; margin-right: auto;\"><img border=\"0\" height=\"360\" src=\"https://miro.medium.com/v2/0*-8c-MXmNvcvTLdHH.png\" width=\"640\" /></a></td></tr><tr><td class=\"tr-caption\" style=\"text-align: center;\"><span style=\"font-family: verdana; text-align: left;\">A GPT model formed by combining successive attention and neural net layers. Input comes in at the left, and its black boxes all the way down.[15]</span></td></tr></tbody></table><div class=\"separator\" style=\"clear: both; text-align: center;\"> <span>  </span></div><span style=\"font-family: verdana;\"><br /></span></div><div><span style=\"font-family: verdana;\">Black box or not, as Sam Altman says, deep learning just works.[16] Sort of - AGI is unlikely without strong ontological and reasoning abilities and a tactile understanding of the physical world.[17] And deep learning itself is not without its problems. If the training data is biased, so will be the results. Trainers have to be alert to overfitting the model to the training data in a way that makes the model ineffective on new data. And implementors need better tools which help introspect and observe the model to provide verification, to illuminate the black box. Until then, any technology which cannot be understood is indistinguishable from magic.</span></div><div><span style=\"font-family: verdana;\"><br /></span><h3 style=\"text-align: left;\"><span style=\"font-family: verdana;\">Hell-o Operator</span></h3></div><div><span style=\"font-family: verdana;\">AI is a broad term, encompassing many technologies, machine learning being just one of them, and deep learning based on neural networks being an even further niche. In many ways, given the black box nature of the solution, AI has become a substitute word for \"automation\", and/or \"program\", or \"algorithm\". And the ill-defined AI landscape is moving fast. Twelve months ago the buzz was about the emergence of the \"prompt engineer\" role in lieu of computer programmers, and today, not so much. Instead we now have thin but actionable (i.e. product-oriented) definitions like \"agent\" and \"assistant\" and a new suite of tools and cute icons to put on enterprise architecture diagrams. This is not to even mention the human and organizational impact of new agent-based workflows characterized by iterative, non-waterfall business processes - not something well understood or appreciated outside of software engineering circles.</span></div><div><span style=\"font-family: verdana;\"><br />In this turbulent time, with vendors leapfrogging each other's capabilities and performance, there is no and cannot be any real standardization, no agreed abstractions on which to base a unifying orchestration layer. Move fast and break things, fix them later if they live long enough. Let the prototype knowingly become the short-lived product, and iterate, maybe. Think: sqrt of web time. Think: ChatGPT + IFTTT.[18] That is not an enterprise IT solution, nor one manageable for most individuals. That is a fine mess.</span></div><div><span style=\"font-family: verdana;\"><br />Thankfully, we'll soon have AI assistants to fix it for us. </span></div><div><span style=\"font-family: verdana;\"><br /></span></div><div><span style=\"font-family: verdana;\">- andy <span style=\"font-size: 16px;\">(linkedin: andygallojr)</span><br /><br /><br /></span><h3 style=\"text-align: left;\"><span style=\"font-family: verdana;\">References</span></h3><span style=\"font-family: verdana;\">[1] <a href=\"https://www.datacenterdynamics.com/en/news/aws-acquires-talens-nuclear-data-center-campus-in-pennsylvania/\">https://www.datacenterdynamics.com/en/news/aws-acquires-talens-nuclear-data-center-campus-in-pennsylvania/</a><br />[2] <a href=\"https://www.datacenterdynamics.com/en/news/three-mile-island-nuclear-power-plant-to-return-as-microsoft-signs-20-year-835mw-ai-data-center-ppa/\">https://www.datacenterdynamics.com/en/news/three-mile-island-nuclear-power-plant-to-return-as-microsoft-signs-20-year-835mw-ai-data-center-ppa/</a></span></div><div><span style=\"font-family: verdana;\">Readers unfamiliar with the nuclear accident at Three Mile Island in 1979 can read the summary here: <a href=\"https://en.wikipedia.org/wiki/Three_Mile_Island_accident\">https://en.wikipedia.org/wiki/Three_Mile_Island_accident</a></span></div><div><span style=\"font-family: verdana;\">[3] <a href=\"https://finance.yahoo.com/news/nvidia-stocks-correction-accelerated-since-020804144.html\">https://finance.yahoo.com/news/nvidia-stocks-correction-accelerated-since-020804144.html</a><br />[4] <a href=\"https://www.smbom.com/news/14253\">https://www.smbom.com/news/14253</a><br />[5] Gartner Hype Cycle for Emerging Technologies, August 2024, <a href=\"https://emt.gartnerweb.com/ngw/globalassets/en/newsroom/images/graphs/august_2024_ethc.png\">https://emt.gartnerweb.com/ngw/globalassets/en/newsroom/images/graphs/august_2024_ethc.png</a><br />[6] \"The Computational Limits of Deep Learning\", <a href=\"https://arxiv.org/pdf/2007.05558\">https://arxiv.org/pdf/2007.05558</a><br />[7] Sam Altman on X: \"there is no wall\", <a href=\"https://x.com/sama/status/1856941766915641580\">https://x.com/sama/status/1856941766915641580</a></span></div><div><span style=\"font-family: verdana;\">[8] Surfing the Singularity blog, <a href=\"https://surfthesing.blogspot.com/2024/12/surfing-singularity-coming-wave-book.html\">https://surfthesing.blogspot.com/2024/12/surfing-singularity-coming-wave-book.html</a></span></div><div><span style=\"font-family: verdana;\">[9] <span style=\"background-color: black; font-size: 15px;\">\"The Coming Wave\", </span><span style=\"background-color: black; font-size: 15px;\">M. Suleyman, Crown Pub., 2023</span><br />[10] <a href=\"https://www.theneurondaily.com/p/openais-leaked-agi-roadmap\">https://www.theneurondaily.com/p/openais-leaked-agi-roadmap</a><br />[11] 12 Days of OpenAI, Day 12: <a href=\"https://www.youtube.com/watch?v=SKBG1sqdyIU\">https://www.youtube.com/watch?v=SKBG1sqdyIU</a> <br />[12] \"AI Snake Oil\", Narayanan &amp; Kapoor, Princeton U. Press, 2024<br />[13] <a href=\"https://www.salesforce.com/news/stories/einstein-sales-agents-announcement\">https://www.salesforce.com/news/stories/einstein-sales-agents-announcement</a><br />[14] <a href=\"https://www.servicenow.com/standard/resource-center/data-sheet/ds-virtual-agent.html\">https://www.servicenow.com/standard/resource-center/data-sheet/ds-virtual-agent.html</a><br />[15] <a href=\"https://miro.medium.com/v2/0*-8c-MXmNvcvTLdHH.png\">https://miro.medium.com/v2/0*-8c-MXmNvcvTLdHH.png</a> We recommend the following video for those not familiar with this architecture:  <a href=\"https://youtu.be/KJtZARuO3JY?si=Muq2xRdSTaa9LMXb\">https://youtu.be/KJtZARuO3JY?si=Muq2xRdSTaa9LMXb</a><br />[16] <a href=\"https://ia.samaltman.com/\">https://ia.samaltman.com/</a><br />[17] Yann LeCun on Lex Fridman podcast, <a href=\"https://www.youtube.com/watch?v=5t1vTLU7s40\">https://www.youtube.com/watch?v=5t1vTLU7s40</a><br />[18] <a href=\"https://ifttt.com/chatgpt\">https://ifttt.com/chatgpt</a><br /><br /></span></div>",
            "url": "https://hpc.social/personal-blog/2025/surfing-the-singularity-please-hold-for-the-next-available-agent/",
            
            
            
            
            
            "date_published": "2025-01-03T17:00:00-07:00",
            "date_modified": "2025-01-03T17:00:00-07:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/surfing-the-singularity-the-coming-wave-a-book-report/",
            "title": "Surfing the Singularity - \"The Coming Wave\" (a book report)",
            "summary": null,
            "content_text": "Mustapha Suleyman knows a thing or two about AI.  Originally co-founder of DeepMind, a company and IP eventually acquired by Google, Mr. Suleyman is now CEO of AI at Microsoft. In this latest \"Surfing the Singularity\" blog installment, we'll review his recent book \"The Coming Wave\". Hang ten!Go Where You Wanna GoAs a game, Go is notorious for its huge array of potential moves, exponentially more complex than chess for example, where computer models beat the best chess player way back in 1997. In 2016, DeepMind's model AlphaGo beat the best Go player in world after being trained the better part of a year with reinforced machine learning on a data set of human Go games and computer-vs-computer play. The following year, DeepMind's AlphaZero exceeded that performance in just a few days of training computation without ever being shown a single Go game, just having been described the rules of the game.[1]   &lt;div style=\"text-align: center;\"&gt;Alas, born at the wrong time.&lt;/div&gt;In his Bill Gates-recommended [2] book \"The Coming Wave\", Mr. Suleyman's dystopian thesis is this: that the combination of AI, synthetic biology, and a host of other general purpose technologies such as robotics and additive manufacturing will combine into a major technological wave which will wash over the human race and alter it in unprecedented ways. Much as in past waves - the harnessing of fire, the wheel, the printing press, the combustion engine - each set off dramatic and often cataclysmic societal change the likes of which was certainly not obvious or expected by the \"engineers\" which developed the tooling. Call it the \"rule of unintended consequences\". The author supposes there have been about two dozen such waves over human history, and as expected in these times, the rate of arrival of transformational technologies is accelerating.Take the printing press. Originally in 1440 there is but one device, the lab prototype. Fifty years later there are 1,000 printing presses in Europe. From producing just a few thousand hand-copied manuscripts a year, the bookmakers now produced millions. Demand for books soars, cost per unit drops, adoption deepens. What was the impact of this new information proliferation in the society? As Suleyman writes: \"Gutenberg just wanted to make money printing Bibles. Yet his press catalyzed the Scientific Revolution and the Reformation, and so became the greatest threat to the Catholic Church since its establishment.\" And in spite of the efforts of certain Byzantine lords to control the press, proliferation of the technology was and is the default, driven by FOMO, at least.Straight Outta CoruscantThe mass-scale rollout of AI is already underway, hand-in-hand with surveillance devices at the edge, high speed networking, nearly bottomless storage, and high performance computing on demand to make sense of it all. All so \"The Algorithm\" can feed you tailored news (and ads) with your morning coffee. And more, much more. Large Language Models (LLMs), trained on the corpus of human written and other creative output can now generate helpful suggestions in a variety of useful contexts (such as blog writing). And as I wrote about in my last blog [3], LLMs are useful code assistants too, although here current state of the art is about a 50% success rate on senior-level software development tasks. So yes, there is room to grow, but in line with the acceleration of the rate of change, we expect that gap to be closed in short time.What then? A whole host of human-centric but generally rule-oriented tasks - think: back-office work in the finance and insurance industry - will become fair game for AI augmentation, meaning, human replacement. We see the rise of autonomous vehicles - think: bus and cab drivers, mail and package delivery, pilots. Air traffic controllers. Call centers. Medical radiology readers. Not one of these applications requires a super \"artificial general intelligence\" (AGI), simply a good model tailored to a specific task set, aka \"artificial capable intelligence\" (ACI). This is nearly all line-of-sight to market.What then is not here now? The author spends a good amount of time discussing the rise and impact of artificial and synthetic biology, CRISPR gene hacking and the like. Not being personally equipped to analyze such biotechnologies, I'm simply going to leave that one to the reader, suggest its some heady stuff, but otherwise stay in the domain of the electro-mechanical. But even with this scope limitation, what is the wider wave?&lt;p&gt;&lt;/p&gt;Consider the rise of the bots, farm bots. GPS-guided autonomous tractors (already a thing).[4] These robots don’t look like C3PO tending the moisture reapers on Tatooine - these robots look like farm equipment and are painted John Deere green. Amazon and Walmart distribution warehouses are already highly automated - combined with autonomous vehicles and AI-driven back-office work, how many employees do we think Amazon will need in 2 years? In 5 years? They currently employ 1.5 million people and reduced headcount 5% in the last 2 years while growing revenues over 20%.[5]&lt;p&gt;&lt;/p&gt;Mind the GapAnd while your local Joe the Plumber [6] may continue to have a job visiting homes for some time to come, the use of single-task robotic automation in construction, especially new commercial construction, and property maintenance is on the rise. Concrete and paint-spraying robots. High rise window washers. Roofers, and general laborers to move material around a job site - bots - flying, floating, swimming, walking, drilling, boring bots. And more factory bots too, using traditional and additive techniques, to print their own parts. Bots that make bots. And an unemployment office at the Department of Labor run by AI.[7]What about joining up with Uncle Sam, see the world, serve your country? Drone warfare in Ukraine has shown the folly of the massing of expensively equipped troops, and in the Red Sea the risks associated with large and high priced floating projections of power. Hypersonic weapons, beyond the capabilities of a human in-the-loop system to thwart. The result is asymmetric bot-on-bot warfare, beyond the battlefield, beyond borders. What are we to do with legions of technically unemployed, if they are not even useful for cannon fodder? And what are we to do with the State, if it cannot provide a system which benefits the population, which can keep it protected from proliferating technological threats? With information, wealth, and power centralized in the hands of a self-selected few, is it pitchforks and torches to the barricades then?[8]While its clear and no surprise the coming wave will benefit those with technical and financial authority, there is a chance of a boomerang effect which will result in forces in the opposite direction. The tooling, including the availability of sophisticated AI models and the means to run them, is being democratized. While increasing in capability, the cost of military-grade drones has decreased orders of magnitude in the last decade.[9] Rabble-rousing AI deepfakes proliferate. As Mr. Suleyman says, \"anyone motivated to sow instability now has an easier time of it\", not just state actors, agents, or oligarchs, but anyone with a few thousand dollars and an axe to grind. And considering the examples from recent 21st century past, if a rogue actor were to leverage the technology for nefarious purposes (think: 9/11 and the Patriot Act), there would surely be immediate call by the population for protection, likely but perhaps not exclusively by the State, backed by pervasive security surveillance. And this time, the means to fully execute on that wish exists. The China Syndrome&lt;p style=\"text-align: center;\"&gt;”The system works! That’s not the problem!”&lt;/p&gt;&lt;div style=\"text-align: left;\"&gt;It is a coming wave of contradictions and competing forces, and it sounds disruptive and quite unpleasant to say the least, perhaps even a human catastrophe. And besides avoiding the topic of bioengineering, we also haven’t yet discussed what happens when we actually do get to superhuman generalized AI - we’re still talking here about relatively dumb AI with human actors in charge, in theory.&lt;/div&gt;&lt;p&gt;&lt;/p&gt;The author Mr. Suleyman concludes that the containment of this new technology - this artificial intelligence backed by autonomous mobility - a containment which has rarely if ever been possible (nukes being maybe the sole exception), must be done successfully, and urgently. Its a good sentiment, albeit one which may be too optimistic, even blindly. Can the march of this autonomous AI \"progress\" with its obvious and as yet to be seen additional consequences be stopped? I would argue, and the author would likely in the final analysis have to admit, that it cannot.What to do about it? Maybe we should give serious thought to the existential question of what it actually means to be human.[10] Or, alternatively, as Timothy Leary said...[11]Until next time.  - andyReferences &amp; Amusements[1] \"The Coming Wave\", Mustapha Suleyman, Crown Pub., 2023[2] Bill Gates blog, https://www.gatesnotes.com/holiday-books-2024[3] \"Surfing the Singularity: Super Grover!\", https://surfthesing.blogspot.com/2024/12/surfing-singularity-super-grover.html[4] \"John Deere Robot Planter\", https://www.cnet.com/tech/john-deere-robot-planter-the-future-of-farming-looks-like-fewer-chemicals/[5] https://www.statista.com/statistics/234488/number-of-amazon-employees/ and https://www.statista.com/statistics/266282/annual-net-revenue-of-amazoncom/[6] https://www.nytimes.com/2023/08/28/us/politics/samuel-wurzelbacher-joe-the-plumber-dead.html[7] https://www.dol.gov/agencies/oasam/centers-offices/ocio/ai-inventory[8] https://www.stlouisfed.org/community-development-research/the-state-of-us-wealth-inequality[9] https://www.technologyreview.com/2023/01/30/1067348/mass-market-military-drones-have-changed-the-way-wars-are-fought/[10] https://www.organism.earth/library/document/unapologetically-human[11] https://www.youtube.com/watch?v=IPSzTBP5PAU",
            "content_html": "<p><span style=\"font-family: verdana;\">Mustapha Suleyman knows a thing or two about AI.  Originally co-founder of DeepMind, a company and IP eventually acquired by Google, Mr. Suleyman is now CEO of AI at Microsoft. In this latest \"Surfing the Singularity\" blog installment, we'll review his recent book \"The Coming Wave\". Hang ten!</span></p><p><span style=\"font-family: verdana;\"><br /></span></p><p><span style=\"font-family: verdana; font-size: x-large;\">Go Where You Wanna Go</span></p><p><span style=\"font-family: verdana;\">As a game, Go is notorious for its huge array of potential moves, exponentially more complex than chess for example, where computer models beat the best chess player way back in 1997. In 2016, DeepMind's model AlphaGo beat the best Go player in world after being trained the better part of a year with reinforced machine learning on a data set of human Go games and computer-vs-computer play. The following year, DeepMind's AlphaZero exceeded that performance in just a few days of training computation without ever being shown a single Go game, just having been described the rules of the game.[1]   </span></p><p><span style=\"font-family: verdana;\"></span></p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p><br />&lt;div style=\"text-align: center;\"&gt;Alas, born at the wrong time.&lt;/div&gt;</p><p></p><p><span style=\"font-family: verdana;\">In his Bill Gates-recommended [2] book \"The Coming Wave\", Mr. Suleyman's dystopian thesis is this: that the combination of AI, synthetic biology, and a host of other general purpose technologies such as robotics and additive manufacturing will combine into a major technological wave which will wash over the human race and alter it in unprecedented ways. Much as in past waves - the harnessing of fire, the wheel, the printing press, the combustion engine - each set off dramatic and often cataclysmic societal change the likes of which was certainly not obvious or expected by the \"engineers\" which developed the tooling. Call it the \"rule of unintended consequences\". The author supposes there have been about two dozen such waves over human history, and as expected in these times, the rate of arrival of transformational technologies is accelerating.</span></p><p><span style=\"font-family: verdana;\">Take the printing press. Originally in 1440 there is but one device, the lab prototype. Fifty years later there are 1,000 printing presses in Europe. From producing just a few thousand hand-copied manuscripts a year, the bookmakers now produced millions. Demand for books soars, cost per unit drops, adoption deepens. What was the impact of this new information proliferation in the society? As Suleyman writes: \"Gutenberg just wanted to make money printing Bibles. Yet his press catalyzed the Scientific Revolution and the Reformation, and so became the greatest threat to the Catholic Church since its establishment.\" And in spite of the efforts of certain Byzantine lords to control the press, proliferation of the technology was and is the default, driven by FOMO, at least.</span></p><p><span style=\"font-family: verdana;\"><br /></span></p><p><span style=\"font-family: verdana; font-size: x-large;\">Straight Outta Coruscant</span></p><p><span style=\"font-family: verdana;\">The mass-scale rollout of AI is already underway, hand-in-hand with surveillance devices at the edge, high speed networking, nearly bottomless storage, and high performance computing on demand to make sense of it all. All so \"The Algorithm\" can feed you tailored news (and ads) with your morning coffee. And more, much more. Large Language Models (LLMs), trained on the corpus of human written and other creative output can now generate helpful suggestions in a variety of useful contexts (such as blog writing). And as I wrote about in my last blog [3], LLMs are useful code assistants too, although here current state of the art is about a 50% success rate on senior-level software development tasks. So yes, there is room to grow, but in line with the acceleration of the rate of change, we expect that gap to be closed in short time.</span></p><p><span style=\"font-family: verdana;\">What then? A whole host of human-centric but generally rule-oriented tasks - think: back-office work in the finance and insurance industry - will become fair game for AI augmentation, meaning, human replacement. We see the rise of autonomous vehicles - think: bus and cab drivers, mail and package delivery, pilots. Air traffic controllers. Call centers. Medical radiology readers. Not one of these applications requires a super \"artificial general intelligence\" (AGI), simply a good model tailored to a specific task set, aka \"artificial capable intelligence\" (ACI). This is nearly all line-of-sight to market.</span></p><p><span style=\"font-family: verdana;\">What then is not here now? The author spends a good amount of time discussing the rise and impact of artificial and synthetic biology, CRISPR gene hacking and the like. Not being personally equipped to analyze such biotechnologies, I'm simply going to leave that one to the reader, suggest its some heady stuff, but otherwise stay in the domain of the electro-mechanical. But even with this scope limitation, what is the wider wave?</span></p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p><span style=\"font-family: verdana;\">&lt;p&gt;<span style=\"font-family: verdana;\"><br /></span>&lt;/p&gt;Consider the rise of the bots, farm bots. GPS-guided autonomous tractors (already a thing).</span><span style=\"font-family: verdana;\">[4]</span><span style=\"font-family: verdana;\"> These robots don’t look like C3PO tending the moisture reapers on Tatooine - these robots look like farm equipment and are painted John Deere green. Amazon and Walmart distribution warehouses are already highly automated - combined with autonomous vehicles and AI-driven back-office work, how many employees do we think Amazon will need in 2 years? In 5 years? They currently employ 1.5 million people and reduced headcount 5% in the last 2 years while growing revenues over 20%.[5]</span>&lt;p&gt;&lt;/p&gt;</p><p><span style=\"font-family: verdana;\"><br /></span></p><p><span style=\"font-family: verdana;\"><span style=\"font-size: x-large;\">Mind the Gap</span></span></p><p><span style=\"font-family: verdana;\">And while your local Joe the Plumber [6] may continue to have a job visiting homes for some time to come, the use of single-task robotic automation in construction, especially new commercial construction, and property maintenance is on the rise. Concrete and paint-spraying robots. High rise window washers. Roofers, and general laborers to move material around a job site - bots - flying, floating, swimming, walking, drilling, boring bots. And more factory bots too, using traditional and additive techniques, to print their own parts. Bots that make bots. And an unemployment office at the Department of Labor run by AI.[7]</span></p><p><span style=\"font-family: verdana;\">What about joining up with Uncle Sam, see the world, serve your country? Drone warfare in Ukraine has shown the folly of the massing of expensively equipped troops, and in the Red Sea the risks associated with large and high priced floating projections of power. Hypersonic weapons, beyond the capabilities of a human in-the-loop system to thwart. The result is asymmetric bot-on-bot warfare, beyond the battlefield, beyond borders. What are we to do with legions of technically unemployed, if they are not even useful for cannon fodder? And what are we to do with the State, if it cannot provide a system which benefits the population, which can keep it protected from proliferating technological threats? With information, wealth, and power centralized in the hands of a self-selected few, is it pitchforks and torches to the barricades then?[8]</span></p><p><span style=\"font-family: verdana;\"></span></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><span style=\"font-family: verdana;\"></span></div><p></p><p><span style=\"font-family: verdana;\">While its clear and no surprise the coming wave will benefit those with technical and financial authority, there is a chance of a boomerang effect which will result in forces in the opposite direction. The tooling, including the availability of sophisticated AI models and the means to run them, is being democratized. While increasing in capability, the cost of military-grade drones has decreased orders of magnitude in the last decade.[9] Rabble-rousing AI deepfakes proliferate. As Mr. Suleyman says, \"anyone motivated to sow instability now has an easier time of it\", not just state actors, agents, or oligarchs, but anyone with a few thousand dollars and an axe to grind. And considering the examples from recent 21st century past, if a rogue actor were to leverage the technology for nefarious purposes (think: 9/11 and the Patriot Act), there would surely be immediate call by the population for protection, likely but perhaps not exclusively by the State, backed by pervasive security surveillance. And this time, the means to fully execute on that wish exists. </span></p><p><span style=\"font-family: verdana;\"><br /></span></p><p><span style=\"font-family: verdana;\"><span style=\"font-size: x-large;\">The China Syndrome</span></span></p><p><span style=\"font-family: verdana;\"><span style=\"font-size: x-large;\"></span></span></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><span style=\"font-size: x-large;\"></span></div><p><span style=\"font-family: verdana;\">&lt;p style=\"text-align: center;\"&gt;<span style=\"font-family: verdana;\">”</span><span color=\"rgba(0, 0, 0, 0.87)\" face=\"Roboto, Helvetica, Arial, sans-serif\" style=\"background-color: white; font-size: 16px; letter-spacing: 0.5px;\">The system works! That’s not the problem!”</span>&lt;/p&gt;</span><span style=\"font-family: verdana;\">&lt;div style=\"text-align: left;\"&gt;It is a coming wave of contradictions and competing forces, and it sounds disruptive and quite unpleasant to say the least, perhaps even a human catastrophe. And besides avoiding the topic of bioengineering, we also haven’t yet discussed what happens when we actually do get to superhuman generalized AI - we’re still talking here about relatively dumb AI with human actors in charge, in theory.&lt;/div&gt;</span>&lt;p&gt;&lt;/p&gt;</p><p><span style=\"font-family: verdana;\">The author Mr. Suleyman concludes that the containment of this new technology - this artificial intelligence backed by autonomous mobility - a containment which has rarely if ever been possible (nukes being maybe the sole exception), must be done successfully, and urgently. Its a good sentiment, albeit one which may be too optimistic, even blindly. Can the march of this autonomous AI \"progress\" with its obvious and as yet to be seen additional consequences be stopped? I would argue, and the author would likely in the final analysis have to admit, that it cannot.</span></p><p><span style=\"font-family: verdana;\">What to do about it? Maybe we should give serious thought to the existential question of what it actually means to be human.[10] Or, alternatively, as Timothy Leary said...[11]</span></p><p><span style=\"font-family: verdana;\">Until next time.  </span><span style=\"font-family: verdana;\">- andy</span></p><p><span style=\"font-family: verdana;\"><br /></span></p><p><br /></p><p><span style=\"font-family: verdana; font-size: x-large;\">References &amp; Amusements</span></p><p><span style=\"font-family: verdana;\">[1] \"The Coming Wave\", </span><span style=\"font-family: verdana;\">Mustapha Suleyman, Crown Pub., 2023</span></p><p><span style=\"font-family: verdana;\">[2] Bill Gates blog, https://www.gatesnotes.com/holiday-books-2024</span></p><p><span style=\"font-family: verdana;\">[3] \"Surfing the Singularity: Super Grover!\", </span><span style=\"font-family: verdana;\">https://surfthesing.blogspot.com/2024/12/surfing-singularity-super-grover.html</span></p><p><span style=\"font-family: verdana;\">[4] \"John Deere Robot Planter\"</span><span style=\"color: #020203;\"><span style=\"font-family: verdana;\">, </span></span><span style=\"font-family: verdana;\">https://www.cnet.com/tech/john-deere-robot-planter-the-future-of-farming-looks-like-fewer-chemicals/</span></p><p><span style=\"font-family: verdana;\">[5] https://www.statista.com/statistics/234488/number-of-amazon-employees/ and https://www.statista.com/statistics/266282/annual-net-revenue-of-amazoncom/</span></p><p><span style=\"font-family: verdana;\">[6] https://www.nytimes.com/2023/08/28/us/politics/samuel-wurzelbacher-joe-the-plumber-dead.html</span></p><p><span style=\"font-family: verdana;\">[7] https://www.dol.gov/agencies/oasam/centers-offices/ocio/ai-inventory</span></p><p><span style=\"font-family: verdana;\">[8] https://www.stlouisfed.org/community-development-research/the-state-of-us-wealth-inequality</span></p><p><span style=\"font-family: verdana;\">[9] https://www.technologyreview.com/2023/01/30/1067348/mass-market-military-drones-have-changed-the-way-wars-are-fought/</span></p><p><span style=\"font-family: verdana;\">[10] https://www.organism.earth/library/document/unapologetically-human</span></p><p><span style=\"font-family: verdana;\">[11] https://www.youtube.com/watch?v=IPSzTBP5PAU</span></p><p><span style=\"font-family: verdana;\"><br /></span></p><p><br /></p>",
            "url": "https://hpc.social/personal-blog/2024/surfing-the-singularity-the-coming-wave-a-book-report/",
            
            
            
            
            
            "date_published": "2024-12-18T17:00:00-07:00",
            "date_modified": "2024-12-18T17:00:00-07:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/surfing-the-singularity-super-grover/",
            "title": "Surfing the Singularity - Super Grover!",
            "summary": null,
            "content_text": "Hello and happy holidays to all. In this blog installment I'll report back from SuperComputing 2024, offer up a programmer-friendly view of the quantum computing space with a code tour of Grover's algorithm, and share some of my own thoughts on using the latest crop of AI programmer assistant tools. (Sadly, not this Grover.)It was a pleasant SC24 high performance computing (HPC) conference in November. Having attended in past either in-person (Atlanta this year) or virtual, this year I chose virtual again. The big loss was being unable to troll the enormous vendor hall, but otherwise, webcasts make it much easier to be in two places at one time or to skim topics of passing interest.[1] There's a new top HPC machine (that we know of) - El Capitan, and its powered by AMD, containing about a million CPU cores and about 10 million GPU cores.[2] Molecular dynamics papers presented using a GPU-accelerated exascale computer reminds that, quantum aside for a moment, the real work is still being done in the classical world.[3] NVIDIA showcased the growing fusion of AI and HPC with their \"superchip\" designs - incorporating a CPU and a GPU on the same chip.[4] And why not, the money keeps flowing, the current outgoing federal administration now locking in the CHIPS Act funding before the end of the term.[5]NVIDIA's Grace Hopper architecture [6]But beyond the incremental improvements in compute, storage, cooling, power consumption and the like, it seemed to me, through my remote goggles, that the real action was happening on the sidelines of SC24, in the quantum computing space.Quantum Chip on Shoulder There's significant skepticism of quantum computing from the HPC community at the moment. Quantum computers today are toys in comparison to HPC, and stakeholders in HPC and classical computing (which would include myself) might wonder aloud \"what's the point?\" For some applications (like the fluid dynamics apps my company uses), quantum utility is perhaps still a decade away.[7] But the groundwork is being laid today, and when we understand that there are useful problems which can be solved on quantum computers, and only on quantum computers, we might at least allow the playing to continue. And we might be surprised at how fast quantum computing is progressing, and also just churning. Even those in the industry are hedging their bets - on which qubit technologies and which companies will be the winners - and as such are changing partners on a regular basis [8,9,10], although this sometimes means needing to unproductively reinvent the wheel (how many Python quantum circuit libraries do we need?)[11]A couple of new announcements from Big Blue caught my eye. First, an early demonstrator of incorporating an IBM quantum computer into an HPC data center, unifying the resource scheduling, was shown at RPI with their AiMOS cluster and their first-in-the-nation academic installation.[12] The second, and more important, was the paper in Nature demonstrating the union of two quantum computers via classical networks, providing another avenue for scaling up hybrid quantum computing.[13] But today's QPUs are still noisy, fragile, small, expensive, and scarce. Maturity is still a ways off. QPUs are not fungible - to be successful in executing an application on a quantum computer, we must understand the error profile of that specific device! Not just that brand of quantum computer product, but this instance of that product! We need hands-on examples to grow the personal and team experience with quantum while we await stabilized hardware and the productivity-enabling software abstractions which can only come after industry maturity, and these early (head-butting) experiences. Making Your Quantum BonesToday's education in and around quantum computing is still focused on the experimental audience. We are still doing experiments about quantum more than experiments with quantum. The educational material which does exist, and there is an increasing amount, is focused on a community which is very comfortable with quantum physics and its associated mathematics. Most people, myself included, do not fit this description. As a computer science graduate, adjunct prof, and software engineer by trade, I want to see higher level programming abstractions, not those centered around qubits and gates which clearly does not scale for large programs. We will be waiting a while. So in the meantime, we need to see some examples using today's technology and syntax but which are more suited for the Comp Sci student audience, to help begin to bridge that gap. If we visit the Algorithm Zoo [14], a collection of quantum algorithms which show a computational advantage over similar classical approaches, we find some things (but not many) which might look familiar to a CS undergrad, but the implementations, when they exist, are often broken. In this current early phase of the quantum era, vendors are playing free and loose with their SDKs, and releases with significant object model refactorings and breaking changes are the norm. pattern_match.pySo I offer here an example of a quantum program which itself likely has a short shelf life. You can find it here.[15, &amp; above] It shows a cooked up example of using Grover's algorithm, which is a quantum algorithm for search in unstructured (e.g. unsorted) data. Classically, \"cuz Murphy's Law\", you might need to walk the entire dataset to find the item of interest (or show its not there) - we call this an O(n) algorithm. Using Grover's algorithm, you can do the search with a strong probability of success in O(sqrt(n)) time - a quadratic speedup, and one worth pursuing for many applications. Note we are only saying we can perform the search with high probability - this is quantum computing, and everything is a probability. And while the example shown isn't necessarily the most common application of Grover, understanding Grover is worthwhile, as it appears as a sub-procedure in many other quantum algorithms.Here are a few key points to take away, even if you don't take the time to look at the documented code example:We're going to store a small set of strings - binary strings of 1's and 0's representing a small dataset - into a quantum system of qubits. Since our strings are chosen to be 16 binary digits long, we will use 16 qubits. This is a large enough number of qubits to show some non-trivial problems, but not so many as to not be runnable on a simulator on your laptop. (The performance does not scale linearly with the number of qubits.)A system of n qubits contains 2**n possible states (combinations of 0's and 1's). That's a lot. We will have far fewer state strings in our dataset - just a handful for this experiment.After initializing the qubits, we will mark each state in the quantum system which corresponds to a binary string in our dataset. To do this we will use the phase of the qubit, which is an extra useful lever you do not find in classical bits (among other unique quantum advantages).Three qubits in superposition.[16]Marking one of the states by flipping its phase.[16]Using Grover's algorithm, we will amplify the probability of finding the marked states relative to other background states. I.e., the signal is separated from the noise. To do this we iteratively apply an oracle quantum circuit to the initial system.The states, with our target amplified, after some number of iterations of Grover's algorithm.[16]Note that we use the same 16 qubits to encode all of the 16-digit strings in the dataset. Try that with a classical computer!We then use Grover's algorithm again using a target string, applying the oracle against the now prepared quantum system, and returning the result that the target either is or is not in the dataset.Hopefully this gives some flavor of what its like to program a quantum computer using an example most classical programmers today can understand. There are places where the code can potentially be improved - I welcome your input in the comments section below.AI AI, Oh. During these quantum coding explorations, during other coding work, and while writing documents (like this one), I've also been experimenting with a sequence of so-called \"AI assistants\". Starting with Copilot as a plugin for the popular VS Code IDE, I quickly switched to the Codeium plugin which performed better for my purposes, mostly because it took more of my code into context while making suggestions.[17] Since, the various vendors have leap-frogged each other in features and quality of results, and so while this tool comparison may be stale, its also the case that any new comparison would soon itself become stale. While the VS Code IDE provides hooks for plugins, some vendors, most notably Cursor [18] took the approach that to provide a true AI assistant (AIA) for a modern software developer you needed plugin hooks in many places, and control over things like the rendering of change sets. So they forked the VS Code IDE entirely. And they weren't the only ones to do this - Codeium now also has its own IDE called Windsurf - this is what I'm currently using. This can't be end state for the discussion and more leap-frogging is expected, so watch this space. I've now seen cases where people are putting \"Cursor\" on their resume - not a good idea to commit to that, but it is a good idea to start using one of these modern tools in your work, and to better understand how to adopt your work practices in light of their potential.In other words, to experiment while these AIA tools improve. The head of Anthropic, a major AI player and the maker of the model driving my current IDE, is proud to say their model reaches 50% success rates on a popular software engineering metric.[19,20] Do you want to pair program with an assistant which is only right 50% of the time? I can tell you that its got its ups and downs.On the positive, for mundane tasks, it can do a pretty good job. Questions like \"how do you do X in Python?\" which you might have previously taken to Stack Overflow can now be answered in the IDE with custom-made code samples. You can turn your question into an instruction: Me: modify this code to take the target string as a command line argument and default to the current target string if none is providedAIA: I've modified the code to...And it will describe in words and code a working solution, albeit in my case one I found to be slightly over-engineered:Me: use sys to determine if a single arg was providedAIA: I'll modify the code to use sys.argv for a simpler command line argument handling...In another interaction, we debated if an AIA-proposed solution was mathematically efficient, at one point:AIA: Would you like me to: 1) Implement a simpler but less efficient version that demonstrates the concept? 2) Or explore how to properly implement the quantum multiplexing needed for efficient database loading?Me: yes, but make it in a new fileAnd it did. The main reason for asking for a new file rather than change in place was that the AIA is still often wrong! And reviewing a large amount of inline change is hard for the human eye and mind.Me: but now your proposed solution doesn't load the database or even look at it!AIA: Ah, you're absolutely right again! And it then goes on to suggest further changes, which you might wonder why those changes weren't proposed in the first place? Keep in mind, its only (proudly) right 50% of the time. It is, as we say, \"generative AI\" after all - it doesn't so much \"hallucinate\" - it is designed to make s***... I mean, \"stuff\" up. (The idea that this GPT-based technology such as it is would be even capable of producing generalized AI (AGI) is an open question, indeed.)But these tools can still be useful, not just for single-line \"tab\" completions, but now as we see here, in higher level conversations with the programmer. It can help articulate requirements, and write test cases, and help drive CI/CD pipelines. And it will improve in scope and accuracy, and custom AIA models tuned for specific programming domains (e.g. quantum) already exist.[21] This is truly the death and rebirth of computer programming, as we have come to experience it. In 1975, IBM's Fred Brooks published the seminal book \"The Mythical Man-Month\" which described, among other things, the software team which one would want to wrap around a senior technical engineer - a team of as many as 10 specialized professionals to handle the documentation, testing, business communications, and more common technical tasks so your senior contributor (\"the surgeon\") can focus on great ideas and great architecture.[22] But however, in today's DevOps culture, where we expect our senior engineers to be \"full stack\", to do it all, to play all roles, the AI tooling brings back some sanity and reminds that there are tasks best left delegated.2025It's that time in the calendar when everyone offers up their forecasts for the coming year. I'll not wade into that. My only prediction is this - that in 2025, the rate of change in key emerging and difficult to humanly understand (nearly black-box) technologies like AI and quantum computing will continue to accelerate, in many cases, beyond our ability to comprehend or predict. This is the so-called singularity, and it's evolving and emerging during a period also marked by political, military, and economic upheaval. Surf's up. Happy New Year. - andyReferences[0] Photo by Ben Wicks on Unsplash[1] SC24 schedule: https://sc24.conference-program.com/[2] El Capitan hardware overview: https://hpc.llnl.gov/documentation/user-guides/using-el-capitan-systems/hardware-overview[3] \"Breaking the Million-Electron and 1 EFLOP/s Barriers: Biomolecular-Scale Ab Initio Molecular Dynamics Using MP2 Potentials\", https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00015[4] NVIDIA SC24 superchip press release: https://www.datacenterdynamics.com/en/news/nvidia-announces-new-gb200-nvl4-superchip-at-sc24-but-says-theres-still-value-to-be-found-in-grace-hopper/[5] Tracking CHIPS Act funding: https://www.semiconductors.org/chips-incentives-awards/[6] NVIDIA Grace Hopper architecture: https://developer-blogs.nvidia.com/wp-content/uploads/2022/11/grace-hopper-overview.png[7] \"Exploring quantum use cases for the aerospace industry\", IBM white paper, https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/quantum-aerospace[8] IonQ with NVIDIA SC24 press release: https://ionq.com/news/ionq-to-advance-hybrid-quantum-computing-with-new-chemistry-application-and[9] Microsoft and Atom quantum press releease: https://azure.microsoft.com/en-us/blog/quantum/2024/11/19/microsoft-and-atom-computing-offer-a-commercial-quantum-machine-with-the-largest-number-of-entangled-logical-qubits-on-record/[10] Alice &amp; Bob logical qubit lib press release : https://alice-bob.com/newsroom/logical-qubit-emulator-felis-quantum-cloud-alice-bob/[11] Quantinuum stack press release: https://www.quantinuum.com/blog/announcing-the-launch-of-quantinuum-nexus-our-all-in-one-quantum-computing-platform[12] RPI's experiments with HPC and quantum co-scheduling: https://www.ibm.com/quantum/blog/supercomputing-24[13] \"Combining quantum processors with real-time classical communication\", Nature, Nov 2024, https://www.nature.com/articles/s41586-024-08178-2[14] Algorithm Zoo: https://quantumalgorithmzoo.org/[15] Pattern match example code: https://github.com/agallojr/research-notes/blob/02253900f33d784402f0cd0b3ed4d9d360544605/quantum/src/qiskit/pattern_match.py[16] \"QC — Grover’s algorithm\", J. Hui, https://jonathan-hui.medium.com/qc-grovers-algorithm-cd81e61cf248[17] Codeium Windsurf IDE: https://codeium.com/windsurf[18] Cursor IDE: https://www.cursor.com/[19] Dario Amodei, CEO Anthropic, on Lex Fridman podcast: https://www.youtube.com/watch?v=ugvHCXCOmm4&amp;t=20s&amp;pp=ygULbGV4IGZyaWRtYW4%3D[20] SWE-bench: https://www.swebench.com/[21] IBM Qiskit Code Assistant: https://www.ibm.com/quantum/blog/qiskit-code-assistant[22] \"The Mythical Man-Month\", Fred Brooks, 1975: https://web.eecs.umich.edu/~weimerw/2018-481/readings/mythical-man-month.pdf",
            "content_html": "<header class=\"pt4\"><p style=\"line-height: 1.2; text-align: left;\"><span style=\"font-weight: normal;\"><span style=\"font-family: verdana;\"><span color=\"rgba(255, 255, 255, 0.9)\" face=\"-apple-system, system-ui, &quot;system-ui&quot;, &quot;Segoe UI&quot;, Roboto, &quot;Helvetica Neue&quot;, &quot;Fira Sans&quot;, Ubuntu, Oxygen, &quot;Oxygen Sans&quot;, Cantarell, &quot;Droid Sans&quot;, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Emoji&quot;, &quot;Segoe UI Symbol&quot;, &quot;Lucida Grande&quot;, Helvetica, Arial, sans-serif\" style=\"background-color: #1b1f23; font-size: 16px;\">Hello and happy holidays to all. In this blog installment I'll report back from SuperComputing 2024, offer up a programmer-friendly view of the quantum computing space with a code tour of Grover's algorithm, and share some of my own thoughts on using the latest crop of AI programmer assistant tools.</span><span class=\"white-space-pre\" color=\"rgba(255, 255, 255, 0.9)\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></span></p></header><div class=\"relative reader__grid\"><div><div><div class=\"reader-article-content reader-article-content--content-blocks\" dir=\"ltr\"><div class=\"reader-content-blocks-container\" tabindex=\"0\"><div class=\"reader-image-block reader-image-block--resize\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember869\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQEXGYm_FH40rg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1733613595470?e=1740009600&amp;v=beta&amp;t=27x_IfTeyWu00AhmowXKRB402xwRZI5SLAg3MUm9UGc\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">(Sadly, not this Grover.)</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember870\"><span style=\"font-family: verdana;\">It was a pleasant SC24 high performance computing (HPC) conference in November. Having attended in past either in-person (Atlanta this year) or virtual, this year I chose virtual again. The big loss was being unable to troll the enormous vendor hall, but otherwise, webcasts make it much easier to be in two places at one time or to skim topics of passing interest.[1]<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember871\"><span style=\"font-family: verdana;\">There's a new top HPC machine (that we know of) - El Capitan, and its powered by AMD, containing about a million CPU cores and about 10 million GPU cores.[2] Molecular dynamics papers presented using a GPU-accelerated exascale computer reminds that, quantum aside for a moment, the real work is still being done in the classical world.[3] NVIDIA showcased the growing fusion of AI and HPC with their \"superchip\" designs - incorporating a CPU and a GPU on the same chip.[4] And why not, the money keeps flowing, the current outgoing federal administration now locking in the CHIPS Act funding before the end of the term.[5]</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember872\"><br /></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember873\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQE0Sq5LcwFEAQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1733610500704?e=1740009600&amp;v=beta&amp;t=Dr2oQXKZQk3AXM8R_LhkmyXIRrnOeaejjn8q97YVy_w\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">NVIDIA's Grace Hopper architecture [6]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember874\"><span style=\"font-family: verdana;\">But beyond the incremental improvements in compute, storage, cooling, power consumption and the like, it seemed to me, through my remote goggles, that the real action was happening on the sidelines of SC24, in the quantum computing space.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember874\"><span style=\"font-family: verdana;\"><br /></span></p><h2><span style=\"font-family: verdana;\">Quantum Chip on Shoulder<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></h2><p class=\"ember-view reader-text-block__paragraph\" id=\"ember876\"><span style=\"font-family: verdana;\">There's significant skepticism of quantum computing from the HPC community at the moment. Quantum computers today are toys in comparison to HPC, and stakeholders in HPC and classical computing (which would include myself) might wonder aloud \"what's the point?\" For some applications (like the fluid dynamics apps my company uses), quantum utility is perhaps still a decade away.[7]<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember877\"><span style=\"font-family: verdana;\">But the groundwork is being laid today, and when we understand that there are useful problems which can be solved on quantum computers, and only on quantum computers, we might at least allow the playing to continue. And we might be surprised at how fast quantum computing is progressing, and also just churning. Even those in the industry are hedging their bets - on which qubit technologies and which companies will be the winners - and as such are changing partners on a regular basis [8,9,10], although this sometimes means needing to unproductively reinvent the wheel (how many Python quantum circuit libraries do we need?)[11]</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember878\"><span style=\"font-family: verdana;\">A couple of new announcements from Big Blue caught my eye. First, an early demonstrator of incorporating an IBM quantum computer into an HPC data center, unifying the resource scheduling, was shown at RPI with their AiMOS cluster and their first-in-the-nation academic installation.[12] The second, and more important, was the paper in Nature demonstrating the union of two quantum computers via classical networks, providing another avenue for scaling up hybrid quantum computing.[13]<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember879\"><span style=\"font-family: verdana;\">But today's QPUs are still noisy, fragile, small, expensive, and scarce. Maturity is still a ways off. QPUs are not fungible - to be successful in executing an application on a quantum computer, we must understand the error profile of that specific device! Not just that brand of quantum computer product, but this instance of that product! We need hands-on examples to grow the personal and team experience with quantum while we await stabilized hardware and the productivity-enabling software abstractions which can only come after industry maturity, and these early (head-butting) experiences.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember879\"><span style=\"font-family: verdana;\"><span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"><br /></span></span></p><h2><span style=\"font-family: verdana;\">Making Your Quantum Bones</span></h2><p class=\"ember-view reader-text-block__paragraph\" id=\"ember881\"><span style=\"font-family: verdana;\">Today's education in and around quantum computing is still focused on the experimental audience. We are still doing experiments about quantum more than experiments with quantum. The educational material which does exist, and there is an increasing amount, is focused on a community which is very comfortable with quantum physics and its associated mathematics. Most people, myself included, do not fit this description. As a computer science graduate, adjunct prof, and software engineer by trade, I want to see higher level programming abstractions, not those centered around qubits and gates which clearly does not scale for large programs. We will be waiting a while. So in the meantime, we need to see some examples using today's technology and syntax but which are more suited for the Comp Sci student audience, to help begin to bridge that gap.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember882\"><span style=\"font-family: verdana;\">If we visit the Algorithm Zoo [14], a collection of quantum algorithms which show a computational advantage over similar classical approaches, we find some things (but not many) which might look familiar to a CS undergrad, but the implementations, when they exist, are often broken. In this current early phase of the quantum era, vendors are playing free and loose with their SDKs, and releases with significant object model refactorings and breaking changes are the norm.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember883\"><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://github.com/agallojr/research-notes/blob/02253900f33d784402f0cd0b3ed4d9d360544605/quantum/src/qiskit/pattern_match.py\" target=\"_self\"><span style=\"font-family: verdana;\">pattern_match.py</span></a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember884\"><span style=\"font-family: verdana;\">So I offer here an example of a quantum program which itself likely has a short shelf life. You can find it here.[15, &amp; above] It shows a cooked up example of using Grover's algorithm, which is a quantum algorithm for search in unstructured (e.g. unsorted) data. Classically, \"cuz Murphy's Law\", you might need to walk the entire dataset to find the item of interest (or show its not there) - we call this an O(n) algorithm. Using Grover's algorithm, you can do the search with a strong probability of success in O(sqrt(n)) time - a quadratic speedup, and one worth pursuing for many applications. Note we are only saying we can perform the search with high probability - this is quantum computing, and everything is a probability. And while the example shown isn't necessarily the most common application of Grover, understanding Grover is worthwhile, as it appears as a sub-procedure in many other quantum algorithms.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember885\"><span style=\"font-family: verdana;\">Here are a few key points to take away, even if you don't take the time to look at the documented code example:</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember886\"></p><ul><li><span style=\"font-family: verdana;\">We're going to store a small set of strings - binary strings of 1's and 0's representing a small dataset - into a quantum system of qubits.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></li><li><span style=\"font-family: verdana;\">Since our strings are chosen to be 16 binary digits long, we will use 16 qubits. This is a large enough number of qubits to show some non-trivial problems, but not so many as to not be runnable on a simulator on your laptop. (The performance does not scale linearly with the number of qubits.)</span></li><li><span style=\"font-family: verdana;\">A system of n qubits contains 2**n possible states (combinations of 0's and 1's). That's a lot. We will have far fewer state strings in our dataset - just a handful for this experiment.</span></li><li><span style=\"font-family: verdana;\">After initializing the qubits, we will mark each state in the quantum system which corresponds to a binary string in our dataset. To do this we will use the phase of the qubit, which is an extra useful lever you do not find in classical bits (among other unique quantum advantages).</span></li></ul><p style=\"color: black;\"></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember887\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQH3qvl7gip82A/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1733610781858?e=1740009600&amp;v=beta&amp;t=5rAwLjF3EHyEDVvd1GHOAO65F-dKDYWy8gCvJ27vCF4\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">Three qubits in superposition.[16]</span></figcaption></figure></div><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember888\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQGrCHC-cx2zUg/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1733610877320?e=1740009600&amp;v=beta&amp;t=lHNGjay8jH78KrwCl9WJwEIHqXuIj4LIX6gviS9nVSM\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">Marking one of the states by flipping its phase.[16]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember889\"></p><ul><li><span style=\"font-family: verdana;\">Using Grover's algorithm, we will amplify the probability of finding the marked states relative to other background states. I.e., the signal is separated from the noise. To do this we iteratively apply an oracle quantum circuit to the initial system.</span></li></ul><p style=\"color: black;\"></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember890\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQHu1UQVg1dbqQ/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1733610938287?e=1740009600&amp;v=beta&amp;t=-vmo3-9uS3FY-kCQiudQ-Ui0eEUyxl0Ao7k45IwSOMg\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">The states, with our target amplified, after some number of iterations of Grover's algorithm.[16]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember891\"></p><ul><li><span style=\"font-family: verdana;\">Note that we use the same 16 qubits to encode all of the 16-digit strings in the dataset. Try that with a classical computer!</span></li><li><span style=\"font-family: verdana;\">We then use Grover's algorithm again using a target string, applying the oracle against the now prepared quantum system, and returning the result that the target either is or is not in the dataset.</span></li></ul><p style=\"color: black;\"></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember892\"><span style=\"font-family: verdana;\">Hopefully this gives some flavor of what its like to program a quantum computer using an example most classical programmers today can understand. There are places where the code can potentially be improved - I welcome your input in the comments section below.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember892\"><span style=\"font-family: verdana;\"><br /></span></p><h2><span style=\"font-family: verdana;\">AI AI, Oh.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></h2><p class=\"ember-view reader-text-block__paragraph\" id=\"ember894\"><span style=\"font-family: verdana;\">During these quantum coding explorations, during other coding work, and while writing documents (like this one), I've also been experimenting with a sequence of so-called \"AI assistants\". Starting with Copilot as a plugin for the popular VS Code IDE, I quickly switched to the Codeium plugin which performed better for my purposes, mostly because it took more of my code into context while making suggestions.[17] Since, the various vendors have leap-frogged each other in features and quality of results, and so while this tool comparison may be stale, its also the case that any new comparison would soon itself become stale.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember895\"><span style=\"font-family: verdana;\">While the VS Code IDE provides hooks for plugins, some vendors, most notably Cursor [18] took the approach that to provide a true AI assistant (AIA) for a modern software developer you needed plugin hooks in many places, and control over things like the rendering of change sets. So they forked the VS Code IDE entirely. And they weren't the only ones to do this - Codeium now also has its own IDE called Windsurf - this is what I'm currently using. This can't be end state for the discussion and more leap-frogging is expected, so watch this space. I've now seen cases where people are putting \"Cursor\" on their resume - not a good idea to commit to that, but it is a good idea to start using one of these modern tools in your work, and to better understand how to adopt your work practices in light of their potential.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember896\"><span style=\"font-family: verdana;\">In other words, to experiment while these AIA tools improve. The head of Anthropic, a major AI player and the maker of the model driving my current IDE, is proud to say their model reaches 50% success rates on a popular software engineering metric.[19,20] Do<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><span face=\"var(--artdeco-reset-typography-font-family-sans)\">you</span><span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span>want to pair program with an assistant which is only right 50% of the time? I can tell you that its got its ups and downs.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember897\"><span style=\"font-family: verdana;\">On the positive, for mundane tasks, it can do a pretty good job. Questions like \"how do you do X in Python?\" which you might have previously taken to Stack Overflow can now be answered in the IDE with custom-made code samples. You can turn your question into an instruction:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember898\"><span style=\"color: #b6d7a8; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">Me</span>: modify this code to take the target string as a command line argument and default to the current target string if none is provided</span></blockquote><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember899\"><span style=\"color: #cc0000; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">AIA</span>: I've modified the code to...</span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember900\"><span style=\"font-family: verdana;\">And it will describe in words and code a working solution, albeit in my case one I found to be slightly over-engineered:</span></p><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember901\"><span style=\"color: #b6d7a8; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">Me</span>: use sys to determine if a single arg was provided</span></blockquote><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember902\"><span style=\"color: #cc0000; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">AIA</span>: I'll modify the code to use sys.argv for a simpler command line argument handling...</span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember903\"><span style=\"font-family: verdana;\">In another interaction, we debated if an AIA-proposed solution was mathematically efficient, at one point:</span></p><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember904\"><span style=\"color: #cc0000; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">AIA</span>: Would you like me to: 1) Implement a simpler but less efficient version that demonstrates the concept? 2) Or explore how to properly implement the quantum multiplexing needed for efficient database loading?</span></blockquote><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember905\"><span style=\"color: #b6d7a8; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">Me</span>: yes, but make it in a new file</span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember906\"><span style=\"font-family: verdana;\">And it did. The main reason for asking for a new file rather than change in place was that the AIA is still often wrong! And reviewing a large amount of inline change is hard for the human eye and mind.</span></p><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember907\"><span style=\"color: #b6d7a8; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">Me</span>: but now your proposed solution doesn't load the database or even look at it!</span></blockquote><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember908\"><span style=\"color: #cc0000; font-family: courier;\"><span face=\"var(--artdeco-reset-typography-font-family-sans)\">AIA</span>: Ah, you're absolutely right again!<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember909\"><span style=\"font-family: verdana;\">And it then goes on to suggest further changes, which you might wonder why those changes weren't proposed in the first place? Keep in mind, its only (proudly) right 50% of the time. It is, as we say, \"generative AI\" after all - it doesn't so much \"hallucinate\" - it is designed to make s***... I mean, \"stuff\" up. (The idea that this GPT-based technology such as it is would be even capable of producing generalized AI (AGI) is an open question, indeed.)</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember910\"><span style=\"font-family: verdana;\">But these tools can still be useful, not just for single-line \"tab\" completions, but now as we see here, in higher level conversations with the programmer. It can help articulate requirements, and write test cases, and help drive CI/CD pipelines. And it will improve in scope and accuracy, and custom AIA models tuned for specific programming domains (e.g. quantum) already exist.[21]<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember911\"><span style=\"font-family: verdana;\">This is truly the death and rebirth of computer programming, as we have come to experience it. In 1975, IBM's Fred Brooks published the seminal book \"The Mythical Man-Month\" which described, among other things, the software team which one would want to wrap around a senior technical engineer - a team of as many as 10 specialized professionals to handle the documentation, testing, business communications, and more common technical tasks so your senior contributor (\"the surgeon\") can focus on great ideas and great architecture.[22] But however, in today's DevOps culture, where we expect our senior engineers to be \"full stack\", to do it all, to play all roles, the AI tooling brings back some sanity and reminds that there are tasks best left delegated.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember911\"><span style=\"font-family: verdana;\"><br /></span></p><h2><span style=\"font-family: verdana;\">2025</span></h2><p class=\"ember-view reader-text-block__paragraph\" id=\"ember913\"><span style=\"font-family: verdana;\">It's that time in the calendar when everyone offers up their forecasts for the coming year. I'll not wade into that. My only prediction is this - that in 2025, the rate of change in key emerging and difficult to humanly understand (nearly black-box) technologies like AI and quantum computing will continue to accelerate, in many cases, beyond our ability to comprehend or predict. This is the so-called singularity, and it's evolving and emerging during a period also marked by political, military, and economic upheaval.<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember914\"><span style=\"font-family: verdana;\">Surf's up. Happy New Year. - andy</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember915\"><br /></p><h2><span style=\"font-family: verdana;\">References</span></h2><p class=\"ember-view reader-text-block__paragraph\" id=\"ember917\"><span style=\"font-family: verdana;\">[0] Photo by Ben Wicks on Unsplash</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember918\"><span style=\"font-family: verdana;\">[1] SC24 schedule:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://sc24.conference-program.com/\" target=\"_self\">https://sc24.conference-program.com/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember919\"><span style=\"font-family: verdana;\">[2] El Capitan hardware overview:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://hpc.llnl.gov/documentation/user-guides/using-el-capitan-systems/hardware-overview\" target=\"_self\">https://hpc.llnl.gov/documentation/user-guides/using-el-capitan-systems/hardware-overview</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember920\"><span style=\"font-family: verdana;\">[3] \"Breaking the Million-Electron and 1 EFLOP/s Barriers: Biomolecular-Scale Ab Initio Molecular Dynamics Using MP2 Potentials\",<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00015\" target=\"_self\">https://dl.acm.org/doi/pdf/10.1109/SC41406.2024.00015</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember921\"><span style=\"font-family: verdana;\">[4] NVIDIA SC24 superchip press release:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.datacenterdynamics.com/en/news/nvidia-announces-new-gb200-nvl4-superchip-at-sc24-but-says-theres-still-value-to-be-found-in-grace-hopper/\" target=\"_self\">https://www.datacenterdynamics.com/en/news/nvidia-announces-new-gb200-nvl4-superchip-at-sc24-but-says-theres-still-value-to-be-found-in-grace-hopper/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember922\"><span style=\"font-family: verdana;\">[5] Tracking CHIPS Act funding:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.semiconductors.org/chips-incentives-awards/\" target=\"_self\">https://www.semiconductors.org/chips-incentives-awards/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember923\"><span style=\"font-family: verdana;\">[6] NVIDIA Grace Hopper architecture:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://developer-blogs.nvidia.com/wp-content/uploads/2022/11/grace-hopper-overview.png\" target=\"_self\">https://developer-blogs.nvidia.com/wp-content/uploads/2022/11/grace-hopper-overview.png</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember924\"><span style=\"font-family: verdana;\">[7] \"Exploring quantum use cases for the aerospace industry\", IBM white paper,<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/quantum-aerospace\" target=\"_self\">https://www.ibm.com/thought-leadership/institute-business-value/en-us/report/quantum-aerospace</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember925\"><span style=\"font-family: verdana;\">[8] IonQ with NVIDIA SC24 press release:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://ionq.com/news/ionq-to-advance-hybrid-quantum-computing-with-new-chemistry-application-and\" target=\"_self\">https://ionq.com/news/ionq-to-advance-hybrid-quantum-computing-with-new-chemistry-application-and</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember926\"><span style=\"font-family: verdana;\">[9] Microsoft and Atom quantum press releease:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://azure.microsoft.com/en-us/blog/quantum/2024/11/19/microsoft-and-atom-computing-offer-a-commercial-quantum-machine-with-the-largest-number-of-entangled-logical-qubits-on-record/\" target=\"_self\">https://azure.microsoft.com/en-us/blog/quantum/2024/11/19/microsoft-and-atom-computing-offer-a-commercial-quantum-machine-with-the-largest-number-of-entangled-logical-qubits-on-record/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember927\"><span style=\"font-family: verdana;\">[10] Alice &amp; Bob logical qubit lib press release :<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://alice-bob.com/newsroom/logical-qubit-emulator-felis-quantum-cloud-alice-bob/\" target=\"_self\">https://alice-bob.com/newsroom/logical-qubit-emulator-felis-quantum-cloud-alice-bob/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember928\"><span style=\"font-family: verdana;\">[11] Quantinuum stack press release:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.quantinuum.com/blog/announcing-the-launch-of-quantinuum-nexus-our-all-in-one-quantum-computing-platform\" target=\"_self\">https://www.quantinuum.com/blog/announcing-the-launch-of-quantinuum-nexus-our-all-in-one-quantum-computing-platform</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember929\"><span style=\"font-family: verdana;\">[12] RPI's experiments with HPC and quantum co-scheduling:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.ibm.com/quantum/blog/supercomputing-24\" target=\"_self\">https://www.ibm.com/quantum/blog/supercomputing-24</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember930\"><span style=\"font-family: verdana;\">[13] \"Combining quantum processors with real-time classical communication\", Nature, Nov 2024,<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.nature.com/articles/s41586-024-08178-2\" target=\"_self\">https://www.nature.com/articles/s41586-024-08178-2</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember931\"><span style=\"font-family: verdana;\">[14] Algorithm Zoo:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://quantumalgorithmzoo.org/\" target=\"_self\">https://quantumalgorithmzoo.org/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember932\"><span style=\"font-family: verdana;\">[15] Pattern match example code:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://github.com/agallojr/research-notes/blob/02253900f33d784402f0cd0b3ed4d9d360544605/quantum/src/qiskit/pattern_match.py\" target=\"_self\">https://github.com/agallojr/research-notes/blob/02253900f33d784402f0cd0b3ed4d9d360544605/quantum/src/qiskit/pattern_match.py</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember933\"><span style=\"font-family: verdana;\">[16] \"QC — Grover’s algorithm\", J. Hui,<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://jonathan-hui.medium.com/qc-grovers-algorithm-cd81e61cf248\" target=\"_self\">https://jonathan-hui.medium.com/qc-grovers-algorithm-cd81e61cf248</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember934\"><span style=\"font-family: verdana;\">[17] Codeium Windsurf IDE:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://codeium.com/windsurf\" target=\"_self\">https://codeium.com/windsurf</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember935\"><span style=\"font-family: verdana;\">[18] Cursor IDE:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.cursor.com/\" target=\"_self\">https://www.cursor.com/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember936\"><span style=\"font-family: verdana;\">[19] Dario Amodei, CEO Anthropic, on Lex Fridman podcast:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.youtube.com/watch?v=ugvHCXCOmm4&amp;t=20s&amp;pp=ygULbGV4IGZyaWRtYW4%3D\" target=\"_self\">https://www.youtube.com/watch?v=ugvHCXCOmm4&amp;t=20s&amp;pp=ygULbGV4IGZyaWRtYW4%3D</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember937\"><span style=\"font-family: verdana;\">[20] SWE-bench:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.swebench.com/\" target=\"_self\">https://www.swebench.com/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember938\"><span style=\"font-family: verdana;\">[21] IBM Qiskit Code Assistant:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.ibm.com/quantum/blog/qiskit-code-assistant\" target=\"_self\">https://www.ibm.com/quantum/blog/qiskit-code-assistant</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember939\"><span style=\"font-family: verdana;\">[22] \"The Mythical Man-Month\", Fred Brooks, 1975:<span class=\"white-space-pre\" face=\"var(--artdeco-reset-typography-font-family-sans)\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://web.eecs.umich.edu/~weimerw/2018-481/readings/mythical-man-month.pdf\" target=\"_self\">https://web.eecs.umich.edu/~weimerw/2018-481/readings/mythical-man-month.pdf</a></span></p><br class=\"Apple-interchange-newline\" /></div></div></div></div></div>",
            "url": "https://hpc.social/personal-blog/2024/surfing-the-singularity-super-grover/",
            
            
            
            
            
            "date_published": "2024-12-07T17:00:00-07:00",
            "date_modified": "2024-12-07T17:00:00-07:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/sc-24-recap/",
            "title": "SC'24 recap",
            "summary": null,
            "content_text": "The premiere annual conference of the high-performance computing community, SC24, was held in Atlanta last week, and    it attracted a record-shattering number of attendees--nearly 18,000 registrants, up 28% from last    year! The conference felt big as well, and there seemed to be a lot more running between sessions, meetings,    and the exhibition floor. Despite its objectively bigger size though, the content of the conference felt more diffuse this year, and I was left wondering if this reflected my own biases or was a real effect of the AI industry    beginning to overflow into AI-adjacent technology conferences like SC.Of course, this isn't to say that SC24 was anything short of a great conference. Some exciting new technologies were    announced, a new supercomputer beat out Frontier to become the fastest supercomputer on the Top500 list, and I got    to catch up with a bunch of great people that I only get to see at shows like this. I'll touch on all of these    things below. But this year felt different from previous SC conferences to me, and I'll try to talk about that too.There's no great way to arrange all the things I jotted down in my notes, but I've tried to arrange them by what readers may be interested in. Here's the table of contents:My approach to SC this yearNew technology and announcementsTop500 and a new #1 system#1 - El Capitan#5 - Eni HPC6#16 and #17 - SoftBank CHIE-2 and CHIE-3#18 - Jülich's JUPITER Exascale Transition Instrument (JETI)#32 - Reindeer!Technology on the exhibit floorGB200Slingshot 400Grace-Grace for storage?Microsoft and AMD's new HBM CPUThe HPC industry overallWhat I learned about the average SC technical program attendeePeople think sustainability and energy efficiency are the same thingAI sessions are really scientific computing sessions about AIAI for operations is not yet real in scientific computingSome are beginning to realize that HPC exists outside of scientific computingNSF's broad front vs. DOE's big bets in HPC and AIExhibitor trendsBooths by the numbersProliferation of GPU-as-a-Service providersCommunity and connectionsGetting to know peopleTalking to early career peopleShift in social mediaSo what's the takeaway?Before getting into the details though, I should explain how my perspective shaped what I noticed (and missed) through the conference. And to be clear: these are my own personal opinions and do not necessarily reflect those of my employer. Although Microsoft covered the cost for me to attend SC, I wrote this blog post during my own free time over the Thanksgiving holiday, and nobody had any editorial control over what follows except me.My approach to SC this yearAlthough this is the eleventh SC conference I've attended, it was the first time that I:attended as a practitioner            of hyperscale AI rather than traditional HPC and scientific computingattended as a Microsoft engineer (I represented Microsoft as a product manager at        SC22 and SC23)did not attend SC as a designated storage person (since 2013)Because of these changes in my identity as an attendee, I approached the    conference with a different set of goals in mind:As a hyperscale/AI person, I felt that I should    prioritize attending all the cloud and AI sessions whenever forced to choose between one session or another. I chose to focus on understanding the traditional HPC community's understanding of hyperscale and AI, which meant I had to spend less time in the workshops, panels and BOFs where I built my career.As an engineer rather than a product manager,    it wasn't my primary responsibility to run private briefings and gather HPC customers' requirements and feedback. Instead, I prioritized only those meetings where my first-hand    knowledge of how massive-scale AI training works could have a meaningful impact. This meant I focused on partners and practitioners who also operate in the realm of            hyperscale--think massive, AI-adjacent companies and the HPC centers who have historically    dominated the very top of the Top500 list.One thing I didn't anticipate going into SC24 is that I've inherited a third identity: there are a new cohort of people in HPC who see me as a long-time community            member. This resulted in a surprising amount of my time being spent talking to students and early career practitioners who were looking    for advice.These three identities and goals meant I don't many notes to share on the technical program, but I did capture more observations about broader trends in the HPC industry and community.New technology and announcementsHPC is all about cutting-edge technology, so that's a fine place to start talking about what was new.Top500 and a new #1 systemA cornerstone of every SC conference is the release of the new Top500 list on Monday, and    this is especially true on years when a new #1 supercomputer is announced. As was widely anticipated in the weeks    leading up to SC24, El Capitan unseated Frontier as the new #1 supercomputer this year, posting an impressive 1.74 EFLOPS of FP64. In addition though, Frontier grew a    little (it added 400 nodes), there was a notable new #5 system (Eni's HPC6), and a number of smaller systems appeared that are worth calling    out.#1 - El CapitanThe highlight of the Top500 list was undoubtedly the debut of El Capitan, Lawrence    Livermore National Laboratory's massive new MI300A-based exascale supercomputer. Its 1.74 EF score resulted from a    105-minute HPL run that came in under 30 MW, and a bunch of technical details about the system were disclosed by    Livermore Computing's CTO, Bronis de Supinski, during an invited talk during the Top500 BOF. Plenty of others    summarize the system's speeds and feeds (e.g., see The        Next Platform's article on El Cap), so I won't do that. However, I will comment on how unusual Bronis' talk    was.Foremost, the El Capitan talk seemed haphazard and last-minute. Considering the system took over half a decade of planning and cost at least half a    billion dollars, El Capitan's unveiling was the most unenthusiastic description of a brand-new #1 supercomputer I've    ever seen. I can understand that the Livermore folks have debuted plenty of novel #1 systems in their careers, but El    Capitan is objectively a fascinating system, and running a full-system job for nearly two hours across first-of-a-kind APUs    is an amazing feat. If community leaders don't get excited about their own groundbreaking achievements, what kind of message should the next generation of HPC professionals take home?In sharp contrast to the blasé announcement of this new system was the leading slide that was presented to describe the speeds and feeds of El Capitan:I've never seen a speaker take the main stage and put a photo of himself literally in the center of the slide, in front of the supercomputer they're talking about. I don't know what the communications people at Livermore were trying to do with this graphic, but I don't think it    was intended to be evocative of the first thing that came to my mind:The supercomputer is literally named \"The Captain,\" and there's a photo of one dude (the boss of Livermore Computing,    who is also standing on stage giving the talk) blocking the view of the machine. It wasn't a great look, and it left me feeling very uneasy about what I was witnessing and what message it was sending to the HPC community.In case it needs to be said, HPC is a team sport. The unveiling of El Capitan (or any other #1 system    before it) is always the product of dozens, if not hundreds, of people devoting years of their professional lives to    ensuring it all comes together. It was a big miss, both to those who put in the work, and those who will have    to put in the work on future systems, to suggest that a single, smiling face comes before the success of the system deployment.#5 - Eni HPC6The other notable entrant to the Top 10 list was HPC6, an industry system deployed by Eni (a major Italian energy    company) built on MI250X. Oil and gas companies tend to be conservative in the systems they buy since the seismic    imaging done on their large supercomputers informs hundred-million to billion-dollar investments in drilling a new    well, and they have much less tolerance for weird architectures than federally funded leadership computing does.    Thus, Eni's adoption of AMD GPUs in this #5 system is a strong endorsement of their capability in mission-critical    commercial computing.#16 and #17 - SoftBank CHIE-2 and CHIE-3SoftBank, the Japanese investment conglomerate who, among other things, owns a significant stake in Arm, made its Top500 debut with two identical 256-node DGX H100 SuperPODs. While    not technologically interesting (H100 is getting old), these systems represent significant investment in HPC by    private industry in Japan and signals that SoftBank is following the lead of large American investment groups in        building private AI clusters for the AI startups in their portfolios. In doing this, SoftBank's investments    aren't dependent on third-party cloud providers to supply the GPUs to make these startups successful and reduces    their overall risk.Although I didn't hear anything about these SoftBank systems at the conference, NVIDIA issued a press statement    during the NVIDIA AI Summit Japan during the week prior to SC24 that discussed SoftBank's        investment in large NVIDIA supercomputers. The press statement states that these systems will be used \"for    [SoftBank's] own generative AI development and AI-related business, as well as that of universities, research    institutions and businesses throughout Japan.\" The release also suggests we can expect B200 and GB200 SuperPODs from    SoftBank to appear as those technologies come online.#18 - Jülich's JUPITER Exascale Transition Instrument (JETI)Just below the SoftBank systems was the precursor system to Europe's first exascale system. I was hoping that    JUPITER, the full exascale system being deployed at FRJ, would appear in the Top 10, but it seems like we'll have to    wait for ISC25 for that. Still, the JETI system ran HPL across 480 nodes of BullSequana XH3000, the same node that    will be used in JUPITER, and achieved 83 TFLOPS. By comparison, the full JUPITER system will be over 10x larger (\"roughly 6000 compute nodes\" in the Booster), and    projecting the JETI run (173 TF/node) out to this full JUPITER scale indicates that JUPITER should just squeak over    the 1.0 EFLOPS line.In preparation for JUPITER, Eviden had a couple of these BullSequana XH3000 nodes out on display this year:And if you're interested in more, I've been tracking the technical details of JUPITER in my digital garden.#32 - Reindeer!Waay down the list was Microsoft's sole new Top500 entry this cycle, an NVIDIA H200 system that ran HPL over 120 ND    H200 v5 nodes in Azure. It was one of only two conventional (non-Grace) H200 clusters that appeared in the top 100,    and it had a pretty good efficiency (Rmax/Rpeak &gt; 80%). Microsoft also had a Reindeer node on display at its    booth:An astute observer may note that this node looks an awful lot like the H100 node used in its Eagle supercomputer,    which was on display at SC23 last year. That's    because it's the same chassis, just with an upgraded HGX baseboard.Reindeer was not super exciting, and there were no press releases about it, but I mention it here for a couple    reasons:One of my teammates did the HPL run and submission, and his group got to come up with the name of the system for        the purposes of HPL. As it turns out, generating a public name for a Top500 submission involves a comical amount        of legal and marketing process when it comes from a giant corporation like Microsoft. And as it turns out,        naming a cluster \"Reindeer\" has a low probability of offending anyone.Reindeer is pretty boring--it's a relatively small cluster with a bunch of GPUs. But when you're building out AI        infrastructure at a pace of 5x Eagles (70,000            GPUs!) per month, you want the clusters that those GPUs go into to be as boring, predictable, and        automatable as possible. Seeing as how Reindeer only used 960 GPUs but still got #32, it doesn't require much        math to realize that the big hyperscalers could flood the Top500 list with these cookie-cutter GPU clusters and        (in this case) make any ranking below #32 completely irrelevant. Heaven help the Top500 list if they ever        publish an API for submitting new systems; cloud providers' build validation automation could tack a Top500        submission on at the end of burn-in and permanently ruin the list.On a personal note, the supercomputer grant that gave me my first job in the HPC business debuted at #48. It's mind-boggling that I now work in a place    where standing up a #32 system is just day-to-day business.Technology on the exhibit floorThe exhibit floor had a few new pieces of HPC technology on display this year that are    worthy of mention, but a lot of the most HPC-centric exciting stuff actually had a soft debut at ISC24 in May. For example, even though SC24 was MI300A's big splash due to    the El Capitan announcement, some MI300A nodes (such as the Cray EX255a) were on display in Hamburg. However,    Eviden had their MI300A node (branded XH3406-3) on display at SC24 which was new to me:I'm unaware of anyone who's actually committed to a large Eviden MI300A system, so I was    surprised to see that Eviden already has a full blade design. But as with Eni's HPC6 supercomputer, perhaps this is    a sign that AMD's GPUs (and now APUs) have graduated from being built-to-order science experiments to a technology    ecosystem that people will want to buy off the rack.There was also a ton of GH200 on the exhibit hall floor, but again, these node types were    also on display at ISC24. This wasn't a surprise since a bunch of upcoming European systems have invested in GH200    already; in addition to JUPITER's 6,000 GH200 nodes described above, CSCS Alps has 2,688 GH200 nodes, and Bristol's Isambard-AI will have 1,362 GH200    nodes. All of these systems will have a 1:1 CPU:GPU ratio and an NVL4 domain, suggesting this is the optimal way to    configure GH200 for HPC workloads. I didn't hear a single mention of GH200 NVL32.GB200SC24 was the debut of NVIDIA's Blackwell GPU in the flesh, and a bunch of integrators had    material on GB200 out at their booths. Interestingly, they all followed the same pattern as GH200 with an NVL4    domain size, and just about every smaller HPC integrator followed a similar pattern wheretheir booth had a standard \"NVIDIA Partner\" (or \"Preferred Partner!\") placard on their main deskthey had a bare NVIDIA GB200 baseboard (superchip) on displaythere wasn't much other differentiationFrom this, I gather that not many companies have manufactured GB200 nodes yet, or if they    have, there aren't enough GB200 boards available to waste them on display models. So, we had to settle for these    bare NVIDIA-manufactured, 4-GPU + 2-CPU superchip boards:What struck me is that these are very large FRUs--if a single component (CPU, GPU, voltage    regulator, DRAM chip, or anything else) goes bad, you have to yank and replace four GPUs and two CPUs. And because    all the components are soldered down, someone's going to have to do a lot of work to remanufacture these boards to    avoid throwing out a lot of very expensive, fully functional Blackwell GPUs.There were a few companies who were further along their GB200 journey and had more    integrated nodes on display. The HPE Cray booth had this GB200 NVL4 blade (the Cray EX154n) on display:It looks remarkably sparse compared to the super-dense blades that normally slot into the    Cray EX line, but even with a single NVL4 node per blade, the Cray EX cabinet only supports 56 of these blades,    leaving 8 blade slots empty in the optimal configuration. I assume this is a limitation of power and cooling.The booth collateral around this blade suggested its use case is \"machine learning and    sovereign AI\" rather than traditional HPC, and that makes sense since each node has 768 GB of HBM3e which is enough    to support training some pretty large sovereign models. However, the choice to force all I/O traffic on to the    high-speed network by only leaving room for one piddly node-local NVMe drive (this blade only supports one SSD per    blade) will make training on this platform very sensitive to the quality of the global storage subsystem. This is    great if you bundle this blade with all-flash Lustre (like Cray ClusterStor) or DAOS (handy, since Intel divested the entire DAOS        development team to HPE). But it's not how I would build an AI-optimized system.I suspect the cost-per-FLOP of this Cray GB200 solution is much lower than what a pure-play    GB200 for LLM training would be. And since GB200 is actually a solid platform for FP64 (thanks to Dan Ernst for challenging me on this and sharing    some great resources on the topic), I expect to see this node do well    in situations that are not training frontier LLMs, but rather fine-tuning LLMs, training smaller models, and mixing    in traditional scientific computing on the same general-purpose HPC/AI system.Speaking of pure-play LLM training platforms, though, I was glad that very few exhibitors    were trying to talk up GB200 NVL72 this year. It may have been the case that vendors simply aren't ready to begin    selling NVL72 yet, but I like to be optimistic and instead believe that the exhibitors who show up to SC24 know that    the scientific computing community likely won't get enough value out of a 72-GPU coherence domain to justify the    additional cost and complexity of NVL72. I didn't see a single vendor with a GB200 NVL36 or NVL72 rack on display    (or a GH200 NVL32, for that matter), and not having to think about NVL72 for the week of SC24 was a nice break from    my day job.Perhaps the closest SC24 got to NVL72 was a joint announcement at the beginning of the week    by Dell and CoreWeave, who announced that they have begun bringing        GB200 NVL72 racks online. Dell did have a massive, AI-focused booth on the exhibit floor, and they did talk    up their high-powered, liquid-cooled rack infrastructure. But in addition to supporting GB200 with NVLink Switches,    I'm sure that rack infrastructure would be equally good at supporting nodes geared more squarely at traditional HPC.Slingshot 400HPE Cray also debuted a new 400G Slingshot switch, appropriately named Slingshot 400. I    didn't get a chance to ask anyone any questions about it, but from the marketing material that came out right before    the conference, it sounds like a serdes upgrade without any significant changes to Slingshot's L2 protocol.There was a Slingshot 400 switch for the Cray EX rack on display at their booth, and it    looked pretty amazing:It looks way more dense than the original 200G Rosetta switch, and it introduces    liquid-cooled optics. If you look closely, you can also see a ton of flyover cables connecting the switch ASIC in    the center to the transceivers near the top; similar flyover cables are showing up in all manner of    ultra-high-performance networking equipment, likely reflecting the inability to maintain signal integrity across PCB    traces.The port density on Slingshot 400 remains the same as it was on 200G Slingshot, so there's    still only 64 ports per switch, and the fabric scale limits don't increase. In addition, the media is saying that    Slingshot 400 (and the GB200 blade that will launch with it) won't start appearing until \"Fall        2025.\" Considering 64-port 800G switches (like NVIDIA's SN5600 and Arista's        7060X6) will have already been on the market by then though, Slingshot 400 will be launching with HPE Cray    on its back foot.However, there was a curious statement on the placard accompanying this Slingshot 400    switch:It reads, \"Ultra Ethernet is the future, HPE Slingshot delivers today!\"Does this suggest that Slingshot 400 is just a stopgap until 800G Ultra Ethernet NICs begin    appearing? If so, I look forward to seeing HPE Cray jam third-party 800G switch ASICs into the Cray EX liquid-cooled    form factor at future SC conferences.Grace-Grace for storage?One of the weirder things I saw on the exhibit floor was a scale-out storage server built    on NVIDIA Grace CPUs that the good folks at WEKA had on display at their booth.Manufactured by Supermicro, this \"ARS-121L-NE316R\" server (really rolls off the tongue)    uses a two-socket Grace superchip and its LPDDR5X instead of conventional, socketed CPUs and DDR. The rest of it    seems like a normal scale-out storage server, with sixteen E3.S SSD slots in the front and four 400G ConnectX-7 or    BlueField-3 NICs in the back. No fancy dual-controller failover or anything like that; the presumption is that    whatever storage system you'd install over this server would implement its own erasure coding across drives and    servers.At a glance, this might seem like a neat idea for a compute-intensive storage system like    WEKA or DAOS. However, one thing that you typically want in a storage server is high reliability and repairability,    features which weren't the optimal design point for these Grace superchips. Specifically,The Grace-Grace superchip turn both CPU sockets into a single FRU. This means that if one CPU goes bad, you're        shipping the whole board back to NVIDIA rather than just doing a field-swap of a socket.Grace uses LPDDR5X, whose ECC is not as robust as DDR5. I'm not an expert on memory architecture, but my        understanding is that the ECC scheme on Grace does not provide ChipKill or row failures. And as with CPU        failure, if a single DRAM chip goes back, you're throwing out two CPUs and all the DRAM.There's no way to value-engineer the exact quantity of cores, clock, and DRAM to be optimal for the storage        software installed on top of these servers.On the upside, though, there might be a cost advantage to using this Grace-Grace server    over a beefier AMD- or Intel-based server with a bunch of traditional DIMMs. And if you really like NVIDIA products,    this lets you do NVIDIA storage servers to go with your NVIDIA network and NVIDIA compute. As long as your storage    software can work with the interrupt rates of such a server (e.g., it supports rebuild-on-read) and the 144 Neoverse    V2 cores are a good fit for its computational requirements (e.g., calculating complex erasure codes), this server    makes sense. But building a parallel storage system on LPDDR5X still gives me the willies.I could also see this thing being useful for certain analytics workloads, especially those    which may be upstream of LLM training. I look forward to hearing about where this turns up in the field.Microsoft and AMD's new HBM CPUThe last bit of new and exciting HPC technology that I noted came from my very own employer    in the form of HBv5, a new, monster four-socket node featuring custom-designed AMD CPUs with HBM. STH wrote up an article with        great photos of HBv5 and its speeds and feeds, but in brief, this single node has:384 physical Zen 4 cores (352 accessible from within the VM) that clock up to 4 GHz512 GB of HBM3 (up to 450 GB accessible from the VM) with up to 6.9 TB/s STREAM bandwidth4x NDR InfiniBand NICs clocked at 200G per port200G Azure Boost NIC (160G accessible from the VM)8x 1.84 TB NVMe SSDs with up to 50 GB/s read and 30 GB/s write bandwidthThe node itself looks kind of wacky as well, because there just isn't a lot on it:There are the obvious four sockets of AMD EPYC 9V64H, each with 96 physical cores and 128 GB of HBM3, and giant heat    pipes on top of them since it's 100% air-cooled. But there's no DDR at all, no power converter board (the node is    powered by a DC bus bar), and just a few flyover cables to connect the PCIe add-in-card cages. There is a separate    fan board with just two pairs of power cables connecting to the motherboard, and that's really about it.The front end of the node shows its I/O capabilities which are similarly uncomplicated:There are four NDR InfiniBand cards (one localized to each socket) which are 400G-capable but cabled up at 200G,    eight E1.S NVMe drives, and a brand-new dual-port Azure Boost 200G NIC. Here's a close-up of the right third of the    node's front:This is the first time I've seen an Azure Boost NIC in a server, and it looksmuch better integrated than the previous-generation 100G Azure SmartNIC that put the FPGA and hard NIC on separateboards connected by a funny little pigtail. This older 100G SmartNIC with pigtail was also on display at the Microsoftbooth in an ND MI300X v5 node:And finally, although I am no expert in this new node, I did hang around the people who are all week, and I    repeatedly heard them answer the same few questions:Is this MI300C? It is if you want it to be. You can call it Sally if you want; I don't think it will        care. But Microsoft calls it HBv5, and the processor name will show up as AMD EPYC 9V64H in /proc/cpuinfo.Is its InfiniBand 1x800 port, 2x400 ports, ...? There are four NDR InfiniBand HCA cards, and each card        has one full 400G NDR InfiniBand port. However, each port is only connected up to top-of-rack switching at 200G.        Each InfiniBand HCA hangs off of a different EPYC 9V64H socket so that any memory address can get to        InfiniBand without having to traverse Infinity Fabric. Running four ports of NDR InfiniBand at half speed is an        unusual configuration, but that's what's going on here.How can I buy this CPU? EPYC 9V64H are \"custom            AMD EPYC processors only available in Azure.\" This means the only way to access it is by provisioning an        HBv5 virtual machine in Azure.Amidst all the unrelenting news about new GPUs optimized for AI workloads, it was nice to see something new and    unique launched squarely for the benefit of traditional scientific computing workloads.The HPC industry overallNew technology announcements are always exciting, but one of the main reasons I attend        SC and ISC is to figure out the broader trends shaping the HPC industry. What concerns are top of mind for the        community, and what blind spots remain open across all the conversations happening during the week? Answering        these questions requires more than just walking the exhibit floor; it involves interpreting the subtext of the        discussions happening at panels and BOF sessions. However, identifying where the industry needs more information        or a clearer picture informs a lot of the public-facing talks and activities in which I participate throughout        the year.What I learned about the average SC technical program attendeeThe biggest realization that I confirmed this week is that the SC conference is not an HPC        conference; it is a scientific computing conference. I sat in a few sessions where the phrase \"HPC    workflows\" was clearly a stand-in for \"scientific workflows,\" and \"performance evaluation\" still really means \"MPI    and OpenMP profiling.\" I found myself listening to ideas or hearing about tools that were intellectually    interesting but ultimately not useful to me because they    were so entrenched in the traditions of applying HPC to scientific computing. Let's talk about a few ways in which    this manifested.People think sustainability and energy efficiency are the same thingTake, for example, the topic of sustainability. There were talks, panels, papers, and BOFs    that touched on the environmental impact of HPC throughout the week, but the vast majority of them really weren't    talking about sustainability at all; they were talking about energy efficiency. These talks often use the following    narrative:Energy use from datacenters is predicted to reach some ridiculous number by 2030We must create more energy-efficient algorithms, processors, and scheduling policiesHere is an idea we tested that reduced the energy consumption without impacting the performance of some        application or workflowSustainability achieved! Success!The problem with this approach is that it declares victory when energy consumption is    reduced. This is a great result if all you care about is spending less money on electricity for your supercomputer,    but it completely misses the much greater issue that the electricity required to power an HPC job is often generated    by burning fossil fuels, and that the carbon emissions that are directly attributable to HPC workloads are    contributing to global climate change. This blind spot was exemplified by this slide, presented during a talk titled    \"Towards Sustainable Post-Exascale Leadership Computing\" at the Sustainable Supercomputing workshop:I've written about        this before and I'll write about it again: FLOPS/Watt and PUE are not    meaningful metrics by themselves when talking about sustainability. A PUE of 1.01 is not helpful if the datacenter    that achieves it relies on burning coal for its power. Conversely, a PUE of 1.5 is not bad if all that electricity    comes from a zero-carbon energy source. The biggest issue that I saw being reinforced at SC this year is that    claims of \"sustainable HPC\" are accompanied by the subtext of \"as long as I can keep doing everything else the way I    always have.\"There were glimmers of hope, though. Maciej Cytowski from Pawsey presented the opening talk    at the Sustainable Supercomputing workshop, and he led with the right thing--he acknowledged that 60% of    the fuel mix that powers Pawsey's supercomputers comes from burning fossil fuels:Rather than patting himself on the back at his low PUE, Dr. Cytowski's described on how    they built their datacenter atop a large aquifer from which they draw water at 21°C and return it at 30°C to avoid    using energy-intensive chillers. To further reduce the carbon impact of this water loop, Pawsey also installed over    200 kW of solar panels on its facility roof to power the water pumps. Given the fact that Pawsey cannot relocate to    somewhere with a higher ratio of zero-carbon energy on account of its need to be physically near the Square    Kilometer Array, Cytowski's talk felt like the most substantive discussion on sustainability in HPC that week.Most other talks and panels on the topic really wanted to equate \"sustainability\" to \"FLOPS    per Watt\" and pretend like where one deploys a supercomputer is not a part of the sustainability discussion. The    reality is that, if the HPC industry wanted to take sustainability seriously, it would talk less about watts and    more about tons of CO2. Seeing as how the average watt of electricity in Tennessee produces 2.75x more carbon than a watt of electricity in Washington,    the actual environmental impact of fine-tuning Slurm scheduling or fiddling with CPU frequencies is meaningless when    compared to the benefits that would be gained by deploying that supercomputer next to a hydroelectric dam instead of    a coal-fired power plant.I say all this because there are parts of the HPC industry (namely, the part in which I work)    who are serious about sustainability. And those conversations go beyond simply building supercomputers in    places where energy is low-carbon (thereby reducing Scope 2 emissions). They    include holding suppliers to high standards on reducing the carbon impact of transporting people and material to    these data centers, reducing the carbon impact of all the excess packaging that accompanies components, and being    accountable for the impact of everything in the data center after it reaches end of life (termed Scope 3 emissions).The HPC community--or more precisely, the scientific computing community--is still married    to the idea that the location of a supercomputer is non-negotiable, and \"sustainability\" is a nice-to-have secondary    goal. I was    hoping that the sessions I attended on sustainability would approach this topic at a level where the    non-scientific HPC world has been living. Unfortunately, the discussion at SC24, which spanned workshops, BOFs, and    Green 500, remains largely stuck on the idea that PUE and FLOPS/Watt are the end-all sustainability metrics. Those    metrics are important, but there are global optimizations that have much greater effects on reducing the    environmental impact of the HPC industry.AI sessions are really scientific computing sessions about AIAnother area where \"HPC\" was revealed to really mean \"scientific computing\" was in the    topic of AI. I sat in on a few BOFs and panels around AI topics to get a feel for where this community is in    adopting AI for science, but again, I found the level of discourse to degrade to generic AI banter despite the best    efforts of panelists and moderators. For example, I sat in the \"Foundational Large Language Models for    High-Performance Computing\" BOF session, and Jeff Vetter very clearly defined what a \"foundational large language    model\" was at the outset so we could have a productive discussion about their applicability in HPC (or, really,    scientific computing):The panelists did a good job of outlining their positions. On the upside, LLMs are good for    performing source code conversion, documenting and validating code, and maximizing continuity in application codes    that get passed around as graduate students come and go. On the downside, they have a difficult time creating    efficient parallel code, and they struggle to debug parallel code. And that's probably where the BOF should have    stopped, because LLMs, as defined at the outset of the session, don't actually have a ton of applicability in    scientific computing. But as soon as the session opened up to audience questions, the session went off the rails.The first question was an extremely basic and nonspecific question: \"Is AI a bubble?\"It's fun to ask provocative questions to a panel of experts. I get it. But the question had    nothing to do with LLMs, any of the position statements presented by panelists, or even HPC or scientific computing.    It turned a BOF on \"LLMs for HPC\" into a BOF that might as well have been titled \"Let's just talk about AI!\" A few    panelists tried to get things back on track by talking about the successes of surrogate models to simulate physical    processes, but this reduced the conversation to a point where \"LLMs\" really meant \"any AI model\" and \"HPC\" really    meant \"scientific simulations.\"Perhaps the most productive statement to come out of that panel was when Rio Yokota    asserted that \"we\" (the scientific community) should not train their own LLMs, because doing so would be    \"unproductive for science.\" But I, as well as anyone who understands the difference between LLMs and \"AI,\" already    knew that. And the people who don't understand the difference between an LLM and a surrogate model probably didn't    pick up on Dr. Yokota's statement, so I suspect the meaning of his contribution was completely lost.Walking out of that BOF (and, frankly, the other AI-themed BOFs and panels I attended), I    was disappointed at how superficial the conversation was. This isn't to say these AI sessions were objectively    bad; rather, I think it reflects the general state of understanding of AI amongst SC attendees. Or perhaps it    reflects the demographic that is drawn to these sorts of sessions. If the SC community is not ready to have a    meaningful discussion about AI in the context of HPC or scientific computing, attending BOFs with like-minded peers    is probably a good place to begin getting immersed.But what became clear to me this past week is that SC BOFs and panels with \"AI\" in their    title aren't really meant for practitioners of AI. They're meant for scientific computing people who are beginning    to dabble in AI.AI for operations is not yet real in scientific computingI was invited to sit on a BOF panel called \"Artificial Intelligence and Machine Learning    for HPC Workload Analysis\" following on a successful BOF in which I participated at ISC24. The broad intent was to    have a discussion around the tools, methods, and neat ideas that HPC practitioners have been using to better    understand workloads, and each of us panelists was tasked with talking about a project or idea we had in applying    AI/ML to improve some aspect of workloads.What emerged from us speakers' lightning talks is that applying AI for operations--in this    case, understanding user workloads--is nascent. Rather than talking about how we use AI to affect how we design or    operate supercomputers, all of us seemed to focus more on how we are collecting data and beginning to analyze that    data using ML techniques. And maybe that's OK, because AI won't ever do anything for workload characterization until    you have a solid grasp of the telemetry you can capture about those workloads in the first place.But when we opened the BOF up to discussion with all attendees, despite having a packed    room, there was very little that the audience had. Our BOF lead, Kadidia Konaté, tried to pull discussion out of the    room from a couple of different fronts by asking what tools people were using, what challenges they were facing, and    things along those lines. However, it seemed to me that the majority of the audience was in that room as spectators;    they didn't know where to start applying AI towards understanding the operations of supercomputers. Folks attended    to find out the art of the possible, not talk about their own challenges.As such, the conversation wound up bubbling back up to the safety of traditional topics in    scientific computing--how is LDMS working out, how do you deal with data storage challenges of collecting telemetry,    and all the usual things that monitoring and telemetry folks worry about. It's easy to talk about the topics you    understand, and just as the LLM conversation reverted back to generic AI for science and the sustainability topic    reverted back to FLOPS/Watt, this topic of AI for operations reverted back to standard telemetry collection.Some are beginning to realize that HPC exists outside of scientific computingDespite the pervasive belief at SC24 that \"HPC\" and \"scientific computing\" are the same thing, there are early signs    that the leaders in the community are coming to terms with the reality that there is now a significant amount of    leadership HPC happening outside the scope of the conference. This was most prominent at the part of the Top500 BOF    where Erich Strohmaier typically discusses trends based on the latest publication of the list.In years past, Dr. Strohmaier's talk was full of statements that strongly implied that, if a supercomputer is not    listed on Top500, it simply does not exist. This year was different though: he acknowledged that El Capitan,    Frontier, and Aurora were \"the three exascale systems we are aware of,\" now being    clear that there is room for exascale systems to exist that simply never ran HPL, or never submitted HPL results to    Top500. He explicitly acknowledged again that China has stopped making any Top500 submissions, and although he    didn't name them outright, he spent a few minutes dancing around \"hyperscalers\" who have been deploying exascale    class systems such as Meta's H100        clusters (2x24K H100), xAI's        Colossus (100K H100), and the full system behind Microsoft's Eagle (14K H100 is a \"tiny fraction\").Strohmaier did an interesting analysis that estimated the total power of the Top500 list's supercomputers so he could    compare it to industry buzz around hyperscalers building gigawatt-sized datacenters:It was a fun analysis where he concluded that there are between 500-600 megawatts of supercomputers on the Top500    list, and after you factor in storage, PUE, and other ancillary power sources, the whole Top500 list sums up to what    hyperscalers are talking about sticking into a single datacenter facility.Although he didn't say it outright, I think the implication here is that the Top500 list is rapidly losing relevance    in the broad HPC market, because a significant amount of the world's supercomputing capacity and capability    are absent from the list. Although specific hyperscale supercomputers (like Meta's, xAI's, and Microsoft's) were not    mentioned outright, their absence from the Top500 list suggests that this list might already be more incomplete than    it is complete--the sum of the FLOPS or power on the Top500 supercomputers may be less than the sum of the giant    supercomputers which are known but not listed. This will only get worse as the AI giants keep building systems every    year while the government is stuck on its 3-5 year procurement cycles.It follows that the meaning of the Top500 is sprinting towards a place where it is not representative of HPC so much    as it is representative of the slice of HPC that serves scientific computing. Erich Strohmaier was clearly    aware of this in his talk this year, and I look forward to seeing how the conversation around the Top500 list    continues to morph as the years go on.NSF's broad front vs. DOE's big bets in HPC and AIMy career was started at an NSF HPC center and built up over my years in the        DOE, so I feel like I owe a debt to the people who provided all the opportunities and mentorship that let me    get to the place of privilege in the hyperscale/AI industry that I now enjoy. As a result, I find myself still    spending a lot of my free time thinking about the role of governments in the changing face of        HPC (as evidenced by my critiques of thinktank reports and federal RFIs...) and trying to bridge the gap    in technical understanding between my old colleagues (in DOE, NSF, and European HPC organizations) and whatever they    call what I work on now (hyperscale AI?).To that end, I found myself doing quite a bit of business development (more on this later) with government    types since I think that is where I can    offer the most impact. I used to be government, and I closely follow the state of their thinking in HPC, but I also    know what's going on inside the hyperscale and AI world. I also have enough context in both areas to draw a line    through all the buzzy AI press releases to demonstrate how the momentum of private-sector investment in AI might    affect the way national HPC    efforts do business. So, I did a lot of talking to both my old colleagues in DOE and their industry partners in an    attempt to help them understand how the hyperscale and AI industry thinks about infrastructure, and what they should    expect in the next year.More importantly though, I also sat in on a couple of NSF-themed BOFs to get a better understanding of where their    thinking is, where NAIRR is going, how the NSF's strategy contrasts with DOE's strategy, and where the ambitions of    the Office of Advanced Cyberinfrastructure might intersect with the trajectory of hyperscale AI.What I learned was that NSF leadership is aware of everything that the community should be concerned about: the    growth of data, the increasing need for specialized silicon, the incursion of AI into scientific computing, new    business models and relationships with industry, and broadening the reach of HPC investments to be globally    competitive. But beyond that, I struggled to see a cohesive vision for the future of NSF-funded    supercomputing. A BOF with a broad range of stakeholders probably isn't the best place to lay out a vision for the future of NSF's    HPC efforts, and perhaps NSF's vision is best expressed through its funding opportunities and awards. Whichever the    case may be, it seems like the NSF remains on a path to make incremental progress on a broad front of topics. Its    Advanced Computing Systems and Services (ACSS) program will continue to fund the acquisition of newer    supercomputers, and a smorgasbord of other research programs will continue funding efforts across public access to    open science, cybersecurity, sustainable software, and other areas. My biggest concern is that peanut-buttering    funding across such a broad portfolio will make net forward progress much slower than taking big bets. Perhaps big    bets just aren't in the NSF's mission though.NAIRR was also a topic that came up in every NSF-themed session I attended, but again, I didn't get a clear picture    of the future. Most of the discussion that I heard was around socializing the resources that are available today    through NAIRR, suggesting that the pilot's biggest issue is not a lack of HPC resources donated by industry, but    awareness that NAIRR is a resource that researchers can use. This was reinforced by a survey whose results were    presented in the NAIRR BOF:It seems like the biggest challenges facing the NSF community relying on NAIRR (which has its own sample bias) is    that they don't really know where to start even though they have AI resources (both GPUs and model API services) at    their disposal. In a sense, this is a great position for the NSF sinceits users need intellectual help more than access to GPU resources, and the NSF has been great at promoting        education, training, and workforce development.its users are unlikely to demand the same cutting-edge GPUs that AI industry leaders are snapping up. For        example, the largest pool of GPUs in NAIRR are A100 GPUs that NVIDIA donated via DGX Cloud; the big AI        companies moved off of Ampere a year ago and are about to move off of Hopper.However, it also means that there's not a clear role for partnership with many industry players beyond donating    resources to the NAIRR pilot today in the hopes of selling resources to the full NAIRR tomorrow. I asked what OAC    leadership thought about moving beyond such a transactional relationship between NSF and industry at one of the BOFs    I attended, and while the panelists were eager to explore specific answers to that question, I didn't hear any ideas    that would approach some sort of truly equitable partnership where both parties contributed in-kind.I also walked away from these NSF sessions struck by how different the NSF HPC community's culture is from that of    the DOE. NSF BOF attendees seemed focused on getting answers and guidance from NSF leadership, unlike the typical    DOE gathering, where discussions often revolve around attendees trying to shape priorities to align with their own    agendas. A room full of DOE people tends to feel like everyone thinks they're the smartest person there, while NSF    gatherings appear more diverse in the expertise and areas of depth of its constituents. Neither way is inherently    better or worse, but it will make the full ambition of NAIRR (as an inter-agency collaboration) challenging to    navigate. This is particularly relevant as DOE is now pursuing its own multi-billion-dollar AI infrastructure    effort, FASST, that appears to sidestep NAIRR.Exhibitor trendsThere's no better way to figure out what's going on in the HPC industry than walking the    exhibit floor each year, because booths cost money and reflect the priorities (and budgets) of all participants.    This year's exhibit felt physically huge, and walking from one end to the other was an adventure. You can get a    sense of the scale from this photo I took during the opening gala:Despite having almost 18,000 registrants and the opening gala usually being acrush of people, the gala this year felt and looked very sparse just because people and booths were more spread out.There was also a perceptibly larger number of splashy vendors who have historically never attended before who werepromoting downstream HPC technologies like data center cooling and electrical distribution, and there was healthyspeculation online about whether the hugeness of the exhibit this year was due to these new power and cooling companies.To put these questions to rest, I figured out how to yank down all the exhibitor metadata    from the conference website so I could do some basic analysis on it.Booths by the numbersThe easiest way to find the biggest companies to appear this year was to compare the    exhibitor list and booth sizes from SC23 to this year and see whose booth went from zero to some big square footage.I only took the top twenty new vendors, but they broadly fall into a couple of categories:Power and cooling: Stulz, Delta, Airedale, Valvoline, Boundary Electric, Schneider Electric, Mara        Server manufacturing: Wistron, AMI, PegatronHigher ed: Tennessee Tech, SCRCCThere were a couple other companies that must've just missed last SC but aren't new to        the show (NetApp, Ansys, Samsung, Micron, Broadcom). And curiously, only one new GPU-as-a-Service provider        (Nebius) showed up this year, suggesting last year was the year of the GPU Cloud.But to confirm what others had speculated: yes, a significant amount of the new square        footage of the exhibit floor can be attributed to companies focused on power and cooling. This is an interesting        indicator that HPC is becoming mainstream, largely thanks to AI demanding ultra-high density of power and        cooling. But it's also heartening to see a few new exhibitors in higher education making an appearance. Notably,        SCRCC (South Carolina Research Computing Consortium) is a consortium between Clemon, University of South        Carolina, and Savannah River National Laboratory that just formed last year, and I look forward to seeing what        their combined forces can bring to bear.We can also take a look at whose booths grew the most compared to SC23:This distribution is much more interesting, since the top 20 exhibitors who grew their footprint comprise the        majority of the growth in existing exhibitors. Cherry-picking a few interesting growers:Power and cooling: USystems, Midas, VertivData center/GPUaaS: iM, Iris Energy, and (arguably) OracleSoftware: Arc Compute and CIQCompanies facing serious financial or legal troubles: I count at least three! Impressive that they            are still pouring money into their SC booths.It's also interesting to see HLRS, the German national HPC center, grow so        significantly. I'm not sure what prompted such a great expansion, but I take it to mean that things have been        going well there.Finally, Dell had a massive booth and showing this year. Not only did they grow the        most since SC23, but they had the single largest booth on the exhibit floor at SC24. This was no doubt a result        of their great successes in partnering with NVIDIA to land massive GPU buildout deals at places like xAI and CoreWeave.        They also had \"AI factory\" messaging emblazoned all over their marketing material and debuted a nice 200 kW        liquid-cooled rack that will be the basis for their GB200 NVL72 solution, clearly leaning into the idea that        they are leaders in AI infrastructure. Despite this messaging being off-beat for the SC audience as I've        described earlier, their booth was surprisingly full all the time, and I didn't actually get a chance to get in        there to talk to anyone about what they've been doing.Equally interesting are the vendors who reduced their footprint at SC24 relative to        SC23:Reading too much into any of these big shrinkers is pretty easy; while a reduction in        booth size could suggest business hasn't been as good, it could equally mean that an exhibitor just went        overboard at SC23 and downsized to correct this year. A few noteworthy exhibitors to call out:Penguin and the Korea Semiconductor Industry Association both cut way back from massive 50x50 booths to            30x30. Their booths this year were both big, but they weren't massive. Viridien, formerly known as CGG, also            shrunk from a massive booth to a less-massive 30x40.Juniper still kept an independent booth, but it is in the process                of being absorbed into HPE. Shrinking makes sense.Major cloud providers Google and AWS scaled back, but Microsoft did not.GPU-as-a-Service cloud providers CoreWeave and Lambda both scaled back. Since these GPUaaS providers'            business models typically rely on courting few big customers, it may make sense to cut back on booth volume.        Major AI storage companies DDN, VAST, and (to a lesser degree) Pure also scaled back, while WEKA did not. I            know business for DDN and VAST has been great this past year, so these may just reflect having gone            overboard last year.Overall, almost twice as many vendors grew their booths than scaled back, so I'd        caution anyone against trying to interpret any of this as anything beyond exhibitors right-sizing their booths        after going all-in last year.Finally, there are a handful of vendors who disappeared outright after SC23:It is critical to point out that the largest booths to vanish outright were all on the        smaller size: SUSE, Tenstorrent, and Symbiosys Alliance all disappeared this year, but their booths last year        were only 20x30. I was surprised to see that Tenstorrent and Arm didn't have booths, but the others are either        companies I haven't heard of (suggesting the return on investment of showing at SC might've been low), are easy        to rationalize as only being HPC-adjacent (such as SNIA and DigitalOcean), or simply went bankrupt in the last        year.As we say at the business factory, the net-net of the exhibit hall this year is that        the square footage of booth space increased by 15,000 square feet, so it was in fact bigger, it did take longer        to walk from one end to the other, and there definitely were a bunch of new power and cooling companies filling        out the space. Some exhibitors shrank or vanished, but the industry as a whole appears to be moving in a healthy        direction.And if you're interested in analyzing this data more yourself, please have a look at the data and the Jupyter notebook I used to generate            the above treemaps on GitHub. If you discover anything interesting, please write about it and post it        online!Proliferation of GPU-as-a-Service providersAs an AI infrastructure person working for a major cloud provider, I kept an eye out for all the companies trying        to get into the GPU-as-a-Service game. I described these players last year as            \"pure-play GPU clouds,\" and it seems like the number of options available to customers who want to go        this route is growing. But I found it telling that a lot of them had booths that were completely        indistinguishable from each other. Here's an example of one:As best I can tell, these companies are all NVIDIA preferred partners with    data centers and a willingness to deploy NVIDIA GPUs, NVIDIA SmartNICs, and NVIDIA cloud stack, and sell multi-year    commitments to consume those GPUs. I tried to accost some of these companies' booth staff to ask them my favorite    question (\"What makes you different from everyone else?\"), but most of these companies' booths were staffed by    people more interested in talking to each other than me.These GPUaaS providers tend to freak me out, because, as Microsoft's CEO recently stated, these companies are        often \"just a bunch of            tech companies still using VC money to buy a bunch of GPUs.\" I can't help but feel like this is where        the AI hype will come back to bite companies who have chosen to build houses upon sand. Walking the SC24 exhibit        floor is admittedly a very narrow view of this line of business, but it seemed like some of these companies were        content to buy up huge booths, hang a pretty banner above it, and otherwise leave the booth empty of anything        beyond a few chairs and some generic value propositions. I didn't feel a lot of hunger or enthusiasm from these        companies despite the fact that a bunch of them have hundreds of millions of dollars of GPUs effectively sitting        on credit cards that they are going to have to make payments on for the next five years.That all said, not all the companies in the GPUaaS are kicking back and letting the money pour in. In particular,        I spent a few minutes chatting up someone at the CoreWeave booth, and I was surprised to hear about how much        innovation they're adding on top of their conventional GPUaaS offering. For example, they developed Slurm on Kubernetes            (SUNK) with one of their key customers to close the gap between the fact that CoreWeave exposes its GPU        service through Kubernetes, but many AI customers have built their stack around Slurm, pyxis, and enroot.    In a weird twist of fate, I later ran into an old acquaintance who turned out to be one of the key CoreWeave        customers for whom SUNK was developed. He commented that SUNK is the real deal and does exactly what his users        need which, given the high standards that this person has historically had, is a strong affirmation that SUNK is        more than just toy software that was developed and thrown on to GitHub for an easy press release. CoreWeave is        also developing some interesting high-performance object storage caching software, and all of these software        services are provided at no cost above whatever customers are already paying for their GPU service.I bring this up because it highlights an emerging distinction in the GPUaaS market, which used to be a homogenous        sea of bitcoin-turned-AI providers. Of course, many companies still rely on that simple business model: holding        the bill for rapidly depreciating GPUs that NVIDIA sells and AI startups consume. However, there are now GPUaaS        providers moving up the value chain by taking on the automation and engineering challenges that model developers        don't want to deal with. Investing in uncertain projects like new software or diverse technology stacks is        certainly risky, especially since they may never result in enough revenue to pay for themselves. But having a        strong point of view, taking a stance, and investing in projects that you feel are right deserves recognition.        My hat is off to the GPUaaS providers who are willing to take these risks and raise the tide for all of us        rather than simply sling NVIDIA GPUs to anyone with a bag of money.Community and connectionsAs much as I enjoy increasing shareholder value, the part of SC that gives me the    greatest joy is reconnecting with the HPC community. Knowing I'll get to chat with my favorite people in the    industry (and meet some new favorite people!) makes the long plane rides, upper respiratory infections, and weird    hotel rooms completely worth it.I wound up averaging under six hours of sleep per night this year in large part because 9pm    or 7am were often the only free times I had to meet with people I really wanted to see. I have this unhealthy    mindset where every hour of every day, from the day I land to the day I leave, is too precious to waste, and it's    far too easy for me to rationalize that spending an hour talking to someone interesting is worth losing an hour of    sleep.But like I said at the outset of this blog post, this year felt different for a few    reasons, and a lot of them revolve around the fact that I think I'm getting old. Now, it's always fun to say \"I'm    getting old\" in a mostly braggadocious way, but this feeling manifested in concrete ways that affected the way I    experienced the conference:I hit my limit on Monday night and couldn't get home without spending 15 minutes sitting in an unlit playground        across from the World of Coke. I've always gotten blisters and fatigue, but this was the first time I couldn't        just cowboy up and muscle through it. To avoid a repeat of this, I wound up \"wasting\" (see above) a lot more        time to just get off my feet this year.This year, I reached the point where I need to start time-box how much time I spend chatting up the folks I        bump into. I used to just let the good times roll if I ran into someone I knew, but this year I wound up        spending as much time attending sessions as I did missing sessions because I got caught up in a conversation.        This isn't a bad thing per se, but I did feel a little sour when I realized I'd made a bad bet on choosing to        chat instead of attending a session or vice versa, and this bad feeling lingered in the back of my mind just        about every day.There weren't a lot of surprises for me at the conference this year, and I worry that I am at risk of losing        touch with the technical aspects of the conference that get newer attendees excited. Instead of hearing about,        say, the latest research in interconnects, more of my time was spent mucking it up with the sorts of people in        the HPC community who I used to find intimidating. On the one hand, hooray me for making it into old boys'        clubs. But on the other, I don't want to become some HPC greybeard whose last meaningful contribution to the        industry was twenty years ago.This is the first year where I've had people accost me and ask me for advice. I've long been accosted by        strangers because of my online presence, but those interactions were always lighthearted exchanges of \"I follow        you on Twitter\" and \"Great to meet you. Have an @HPC_Guru pin.\" This year, I had people specifically ask me for        advice on industry versus postdoc, AI versus HPC, and what my master plan was when I left NERSC. Even though I        didn't have any sage advice, I still found it really hard to tell bright-eyed students to go kick rocks just so        I wouldn't be late for yet another mushy panel on AI.If you read this all and think \"boo hoo, poor Glenn is too popular and wise for his own    good,\" yeah, I get it. There are worse problems to have. But this was the first year where I felt like what I put    into the conference was greater than what I got out of it. Presenting at SC used to be at least as good for my    career as it was useful for my audiences, but it just doesn't count for much given my current role and career stage.    It felt like some of the magic was gone this year in a way I've never experienced before. Getting to know peopleAs the years have gone on, I spend an increasing amount of my week having one-on-one    conversations instead of wandering aimlessly. This year though, I came to SC without really having anything to buy    or sell:I am not a researcher, so I don't need to pump up the work I'm doing to impress my fellow researchers.I no longer own a product market segment, so I don't directly influence the customers or vendors with whom my        employer works.I don't have any bandwidth in my day job to support any new customers or partnerships, so I don't have a strong        reason to sell people on partnering with me or my employer. Much to my surprise though, a bunch of my old vendor/partner colleagues still wanted to get    together to chat this year. Reflecting back, I was surprised to realize that it was these conversations--not the    ones about business--that were the most fulfilling this year.I learned about people's hobbies, families, and their philosophies on life, and it was    amazing to get to know some of the people behind the companies with whom I've long dealt. I was reminded that the    person is rarely the same as the company, and even behind some of the most aggressive and blusterous tech companies    are often normal people with the same concerns and moments of self-doubt that everyone else has. I was also reminded    that good engineers appreciate good engineering regardless of whether it's coming from a competitor or not. The    public persona of a tech exec may not openly admire a competitor's product, but that doesn't mean they don't know    good work when they see it.I also surprised a colleague whose career has been in the DOE labs with an anecdote that    amounted to the following: even though two companies may be in fierce competition, the people who work for them    don't have to be. The HPC community is small enough that almost everyone has got a pal at a competing company, and    when there are deals to be made, people looove to gossip. If one salesperson hears a juicy rumor about a prospective    customer, odds are that everyone else on the market will hear about it pretty quickly too. Of course, the boundaries    of confidentiality and professionalism are respected when it matters, but the interpersonal relationships that are    formed between coworkers and friends don't suddenly disappear when people change jobs.And so, I guess it would make sense that people still want to talk to me even though I have    nothing to buy or sell. I love trading gossip just as much as everyone else, and I really enjoyed this aspect of the    week.Talking to early career peopleI also spent an atypically significant amount of my week talking to early career people in    HPC who knew of me one way or another and wanted career advice. This is the first year I recall having the same    career conversations with multiple people, and this new phase of my life was perhaps most apparent during the IEEE    TCHPC/TCPP HPCSC career panel in which I was invited to speak this year.It was an honor to be asked to present on a career panel, but I didn't feel very qualified to give career advice to    up-and-coming computer science graduate students who want to pursue HPC. I am neither a computer scientist nor a    researcher, but fortunately for me, my distinguished co-panelists (Drs. Dewi Yokelson, Olga Pearce, YJ Ji, and    Rabab Alomairy) had plenty of more relevant wisdom to share. And at the end of the panel, there were a few things we    all seemed to agree on as good advice:Knowing stuff is good, but being able to learn things is better. Being eager to learn and naturally curious        makes this much easier as well.The life of a researcher sometimes requires more than working a standard nine-to-five, so it'll be hard to be        really successful if your heart isn't in it.People will forget what you did or what you said,            but they remember how you made them feel. Don't be a jerk, because this community is small.In both this panel the one-on-one conversations I had with early career individuals, the best I could offer was the    truth: I never had a master plan that got me to where I am; I just try out new things until I realize I don't like    doing them anymore. I never knew what I wanted to be when I grew up, and I still don't really, so it now makes me    nervous that people have started approaching me with the assumption that I've got it all figured out. Unless I    torpedo my career and go live on a goat farm though, maybe I should prepare for this to be a significant part of my    SC experiences going forward.Shift in social mediaOne last, big change in the community aspect of SC this year was the mass-migration of a ton of HPC folks from    Twitter to Bluesky during the week prior to the conference. I don't really understand what prompted it so suddenly;    a few of us have been trying for years to get some kind of momentum on other social platforms like Mastodon, but the    general lack of engagement meant that all the excitement around SC always wound up exclusively on Twitter. This year    was different though, and Bluesky hit critical mass with the HPC community.I personally have never experienced an SC conference without Twitter; my first SC was in 2013, and part of what made    that first conference so exciting was being able to pull up my phone and see what other people were seeing,    thinking, and doing across the entire convention center via Twitter. Having the social media component to the    conference made me feel like I was a part of something that first year, and as the years went on, Twitter became an    increasingly indispensable part of the complete SC experience for me.This year, though, I decided to try an        experiment and see what SC would be like if I set Twitter aside and invested my time into Bluesky instead.The verdict? It was actually pretty nice.It felt a lot like the SC13 days, where my day ended and began with me popping open Bluesky to see what new #SC24 posts were made. And because many of the tech companies and HPC    centers hadn't yet made it over, the hashtag wasn't clogged up by a bunch of prescheduled marketing blasts that    buried the posts written by regular old conference attendees who were asking important questions:Which booths at #sc24 have coffee? I noticed oracle do. Anyone else?— Mike Croucher (@walkingrandomly.bsky.social) November 18, 2024 at 3:02 PMOf course, I still clogged Bluesky up with my nonsense during the week, but there was an amazing amount of    engagement by a diversity of thoughtful people--many who came from Twitter, but some whose names and handles I    didn't recognize.The volume of traffic on Bluesky during the week did feel a little lower than what it had been on Twitter in years    past though. I also didn't see as many live posts of technical sessions as they happened, so I couldn't really tell    whether I was missing something interesting in real time. This may have contributed to why I felt a little less    connected to the pulse of the conference this year than I had in the past. It also could've been the fact that    conference was physically smeared out across a massive space though; the sparsity of the convention center was at    least on par with the sparsity on Bluesky.At the end of the week, I didn't regret the experiment. In fact, I'll probably be putting more effort into my Bluesky    account than my Twitter account going forward. To be clear though, this isn't a particularly political decision on    my part, and I pass no judgment on anyone who wants to use one platform over the other. It's just that I like the    way I feel when I scroll through my Bluesky feeds, and I don't get that same feeling when I use Twitter.So what's the takeaway?SC this year was a great conference by almost every measure, as it always is, but it still felt a little different for me. I'm sure that some of that feeling is the result of my own growth, and my role with respect to the conference seems to be evolving from someone who gets a lot out of the conference to someone who is giving more to the conference. That's not to say that I don't get a lot out of it, though; I had no shortage of wonderful interactions with everyone from technology executives to rising stars who are early in their career, and I learned a lot about both them and me as whole people. But SC24, more than any SC before it, is when I realized this change was happening.On the technological front, we saw the debut of a new #1 system (emblazoned with the smiling face of Bronis...) and a growing crop of massive, new clusters deployed for commercial applications. The exhibit floor was quantitatively bigger, in large part due to new power and cooling companies who are suddenly relevant to the HPC world thanks to the momentum of AI. At the same time, the SC technical program is clearly separating itself out as a conference focused on scientific computing; the level of discourse around AI remains largely superficial compared to true AI conferences, the role of hyperscalers in the HPC industry is still cast more as a threat than an opportunity.For my part, I'm still trying to get a grasp on where government agencies like DOE and NSF want to take their AI ambitions so I can try to help build a better mutual understanding between the scientific computing community and the hyperscale AI community. However, it seems like the NSF is progressing slowly on a wide front, while the DOE is doing what DOE does and charging headfirst into a landscape that has changed more than I think they realize.There's a lot of technical content that I know I missed on account of the increasing time I've been spending on the people and community aspect of the conference, and I'm coming to terms with the idea that this just may be the way SC is from now on. And I think I'm okay with that, since the support of the community is what helped me go from being a bored materials science student into someone whose HPC career advice is worth soliciting in the short span of eleven years. Despite any or all of the cynicism that may come out in the things I say about this conference, SC is always the highlight of my year. I always go into it with excitement, gladly burn the candle at both ends all week, and fly home feeling both grateful for and humbled by everything the HPC community has done and continues to do to keep getting me out of bed in the morning.",
            "content_html": "<p>The premiere annual conference of the high-performance computing community, SC24, was held in Atlanta last week, and    it attracted a record-shattering number of attendees--<a href=\"https://www.hpcwire.com/2024/11/20/sc24-half-way-there/\">nearly 18,000 registrants</a>, up 28% from last    year! The conference <i>felt</i> big as well, and there seemed to be a lot more running between sessions, meetings,    and the exhibition floor. Despite its objectively bigger size though, the content of the conference felt more diffuse this year, and I was left wondering if this reflected my own biases or was a real effect of the AI industry    beginning to overflow into AI-adjacent technology conferences like SC.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>Of course, this isn't to say that SC24 was anything short of a great conference. Some exciting new technologies were    announced, a new supercomputer beat out Frontier to become the fastest supercomputer on the Top500 list, and I got    to catch up with a bunch of great people that I only get to see at shows like this. I'll touch on all of these    things below. But this year felt different from previous SC conferences to me, and I'll try to talk about that too.</p><p>There's no great way to arrange all the things I jotted down in my notes, but I've tried to arrange them by what readers may be interested in. Here's the table of contents:</p><p></p><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#approach\">My approach to SC this year</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech\">New technology and announcements</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-top500\">Top500 and a new #1 system</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-top500-elcap\">#1 - El Capitan</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-top500-hpc6\">#5 - Eni HPC6</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-top500-softbank\">#16 and #17 - SoftBank CHIE-2 and CHIE-3</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-top500-jeti\">#18 - Jülich's JUPITER Exascale Transition Instrument (JETI)</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-top500-reindeer\">#32 - Reindeer!</a></li></ol></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-expo\">Technology on the exhibit floor</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-expo-gb200\">GB200</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-expo-ss400\">Slingshot 400</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-expo-gg\">Grace-Grace for storage?</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#tech-expo-hbv5\">Microsoft and AMD's new HBM CPU</a></li></ol></li></ol></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry\">The HPC industry overall</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-attendee\">What I learned about the average SC technical program attendee</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-attendee-sustainability\">People think sustainability and energy efficiency are the same thing</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-attendee-ai\">AI sessions are really scientific computing sessions about AI</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-attendee-ops\">AI for operations is not yet real in scientific computing</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-attendee-hyperscale\">Some are beginning to realize that HPC exists outside of scientific computing</a></li></ol></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-nsf\">NSF's broad front vs. DOE's big bets in HPC and AI</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-expo\">Exhibitor trends</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-expo-booths\">Booths by the numbers</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-expo-gpuaas\">Proliferation of GPU-as-a-Service providers</a></li></ol></li></ol></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#community\">Community and connections</a><ol><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#community-people\">Getting to know people</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#community-career\">Talking to early career people</a></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#community-bsky\">Shift in social media</a></li></ol></li><li><a href=\"https://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#conclusion\">So what's the takeaway?</a></li></ol><p>Before getting into the details though, I should explain how my perspective shaped what I noticed (and missed) through the conference. And to be clear: <b><i><span style=\"color: #cc0000;\">these are my own personal opinions and do not necessarily reflect those of my employer</span></i></b>. Although Microsoft covered the cost for me to attend SC, I wrote this blog post during my own free time over the Thanksgiving holiday, and nobody had any editorial control over what follows except me.</p><p></p><h2 id=\"approach\">My approach to SC this year</h2><p>Although this is the eleventh SC conference I've attended, it was the first time that I:</p><p></p><ol><li>attended as a <a href=\"https://blog.glennklockwood.com/2024/08/how-has-life-after-leaving-labs-been.html#hpc-ai-development\">practitioner            of hyperscale AI</a> rather than traditional HPC and scientific computing</li><li>attended as a Microsoft engineer (I represented Microsoft as a <a href=\"https://blog.glennklockwood.com/2024/08/how-has-life-after-leaving-labs-been.html#storage-product-management\">product manager</a> at        SC22 and SC23)</li><li>did not attend SC as a designated storage person (since 2013)</li></ol><p>Because of these changes in my <b><span style=\"color: #990000;\">identity</span></b> as an attendee, I approached the    conference with a different set of <b><span style=\"color: #0b5394;\">goals</span></b> in mind:</p><p>As a <b><span style=\"color: #990000;\">hyperscale/AI person</span></b>, I felt that I should    prioritize <b><span style=\"color: #0b5394;\">attending all the cloud and AI sessions</span></b> whenever forced to choose between one session or another. I chose to focus on understanding the traditional HPC community's understanding of hyperscale and AI, which meant I had to spend less time in the workshops, panels and BOFs where I built my career.</p><p>As an <b><span style=\"color: #990000;\">engineer</span></b> rather than a product manager,    it wasn't my primary responsibility to run private briefings and gather HPC customers' requirements and feedback. Instead, I prioritized only those meetings where my first-hand    knowledge of how massive-scale AI training works could have a meaningful impact. This meant I <b><span style=\"color: #0b5394;\">focused on partners and practitioners who also operate in the realm of            hyperscale</span></b>--think massive, AI-adjacent companies and the HPC centers who have historically    dominated the very top of the Top500 list.</p><p>One thing I didn't anticipate going into SC24 is that I've inherited a third identity: there are a new cohort of people in HPC who see me as a <b><span style=\"color: #990000;\">long-time community            member</span></b>. This resulted in a surprising amount of my time being spent <b><span style=\"color: #0b5394;\">talking to students and early career practitioners</span></b> who were looking    for advice.</p><p>These three identities and goals meant I don't many notes to share on the technical program, but I did capture more observations about broader trends in the HPC industry and community.</p><h2 id=\"tech\">New technology and announcements</h2><div>HPC is all about cutting-edge technology, so that's a fine place to start talking about what was new.</div><h3 id=\"tech-top500\">Top500 and a new #1 system</h3><p>A cornerstone of every SC conference is the release of the new Top500 list on Monday, and    this is especially true on years when a new #1 supercomputer is announced. As was widely anticipated in the weeks    leading up to SC24, El Capitan unseated Frontier as the new #1 supercomputer this year, posting an impressive <a href=\"https://www.top500.org/system/180307/\">1.74 EFLOPS</a> of FP64. In addition though, Frontier grew a    little (it added 400 nodes), there was a notable new #5 system (Eni's HPC6), and a number of smaller systems appeared that are worth calling    out.</p><h4 id=\"tech-top500-elcap\">#1 - El Capitan</h4><p>The highlight of the Top500 list was undoubtedly the debut of El Capitan, Lawrence    Livermore National Laboratory's massive new MI300A-based exascale supercomputer. Its 1.74 EF score resulted from a    105-minute HPL run that came in under 30 MW, and a bunch of technical details about the system were disclosed by    Livermore Computing's CTO, Bronis de Supinski, during an invited talk during the Top500 BOF. Plenty of others    summarize the system's speeds and feeds (e.g., see <a href=\"https://www.nextplatform.com/2024/11/18/el-capitan-supercomputer-blazes-the-trail-for-converged-cpu-gpu-compute/\">The        Next Platform's article on El Cap</a>), so I won't do that. However, I will comment on how unusual Bronis' talk    was.</p><p>Foremost, the El Capitan talk seemed haphazard and last-minute. Considering the system took over half a decade of planning and cost at least half a    billion dollars, El Capitan's unveiling was the most unenthusiastic description of a brand-new #1 supercomputer I've    ever seen. I can understand that the Livermore folks have debuted plenty of novel #1 systems in their careers, but El    Capitan is objectively a fascinating system, and running a full-system job for nearly two hours across first-of-a-kind APUs    is an amazing feat. If community leaders don't get excited about their own groundbreaking achievements, what kind of message should the next generation of HPC professionals take home?</p><p>In sharp contrast to the blasé announcement of this new system was the leading slide that was presented to describe the speeds and feeds of El Capitan:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>I've never seen a speaker take the main stage and put <i>a photo of himself</i> literally in the center of the slide, in front of the supercomputer they're talking about. I don't know what the communications people at Livermore were trying to do with this graphic, but I don't think it    was intended to be evocative of the first thing that came to my mind:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>The supercomputer is literally named \"The Captain,\" and there's a photo of one dude (the boss of Livermore Computing,    who is also standing on stage giving the talk) blocking the view of the machine. It wasn't a great look, and it left me feeling very uneasy about what I was witnessing and what message it was sending to the HPC community.</p><p>In case it needs to be said, HPC is a team sport. The unveiling of El Capitan (or any other #1 system    before it) is always the product of dozens, if not hundreds, of people devoting years of their professional lives to    ensuring it all comes together. It was a big miss, both to those who put in the work, and those who will have    to put in the work on future systems, to suggest that a single, smiling face comes before the success of the system deployment.</p><h4 id=\"tech-top500-hpc6\">#5 - Eni HPC6</h4><p>The other notable entrant to the Top 10 list was HPC6, an industry system deployed by Eni (a major Italian energy    company) built on MI250X. Oil and gas companies tend to be conservative in the systems they buy since the seismic    imaging done on their large supercomputers informs hundred-million to billion-dollar investments in drilling a new    well, and they have much less tolerance for weird architectures than federally funded leadership computing does.    Thus, Eni's adoption of AMD GPUs in this #5 system is a strong endorsement of their capability in mission-critical    commercial computing.</p><h4 id=\"tech-top500-softbank\">#16 and #17 - SoftBank CHIE-2 and CHIE-3</h4><p>SoftBank, the Japanese investment conglomerate who, among other things, owns a significant stake in Arm, made its <a href=\"https://www.top500.org/site/51045/\">Top500 debut with two identical 256-node DGX H100 SuperPODs</a>. While    not technologically interesting (H100 is getting old), these systems represent significant investment in HPC by    private industry in Japan and signals that SoftBank is following the lead of large <a href=\"https://www.nytimes.com/2023/08/16/technology/ai-gpu-chips-shortage.html\">American investment groups in        building private AI clusters for the AI startups in their portfolios</a>. In doing this, SoftBank's investments    aren't dependent on third-party cloud providers to supply the GPUs to make these startups successful and reduces    their overall risk.</p><p>Although I didn't hear anything about these SoftBank systems at the conference, NVIDIA issued a press statement    during the NVIDIA AI Summit Japan during the week prior to SC24 that discussed <a href=\"https://nvidianews.nvidia.com/news/nvidia-and-softbank-accelerate-japans-journey-to-global-ai-powerhouse\">SoftBank's        investment in large NVIDIA supercomputers</a>. The press statement states that these systems will be used \"for    [SoftBank's] own generative AI development and AI-related business, as well as that of universities, research    institutions and businesses throughout Japan.\" The release also suggests we can expect B200 and GB200 SuperPODs from    SoftBank to appear as those technologies come online.</p><h4 id=\"tech-top500-jeti\">#18 - Jülich's JUPITER Exascale Transition Instrument (JETI)</h4><p>Just below the SoftBank systems was the precursor system to Europe's first exascale system. I was hoping that    JUPITER, the full exascale system being deployed at FRJ, would appear in the Top 10, but it seems like we'll have to    wait for ISC25 for that. Still, the JETI system ran HPL across 480 nodes of BullSequana XH3000, the same node that    will be used in JUPITER, and achieved 83 TFLOPS. By comparison, the full JUPITER system will be over 10x larger (\"<a href=\"https://www.fz-juelich.de/en/ias/jsc/jupiter/tech\">roughly 6000 compute nodes</a>\" in the Booster), and    projecting the JETI run (173 TF/node) out to this full JUPITER scale indicates that JUPITER should just squeak over    the 1.0 EFLOPS line.</p><p>In preparation for JUPITER, Eviden had a couple of these BullSequana XH3000 nodes out on display this year:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>And if you're interested in more, I've been tracking the technical details of <a href=\"https://glennklockwood.com/garden/systems/jupiter\">JUPITER in my digital garden</a>.</p><h4 id=\"tech-top500-reindeer\">#32 - Reindeer!</h4><p>Waay down the list was Microsoft's sole new Top500 entry this cycle, an NVIDIA H200 system that ran HPL over 120 ND    H200 v5 nodes in Azure. It was one of only two conventional (non-Grace) H200 clusters that appeared in the top 100,    and it had a pretty good efficiency (Rmax/Rpeak &gt; 80%). Microsoft also had a Reindeer node on display at its    booth:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>An astute observer may note that this node looks an awful lot like the H100 node used in its Eagle supercomputer,    which was <a href=\"https://blog.glennklockwood.com/2023/11/sc23-recap.html\">on display at SC23 last year</a>. That's    because it's the same chassis, just with an upgraded HGX baseboard.</p><p>Reindeer was not <i>super</i> exciting, and there were no press releases about it, but I mention it here for a couple    reasons:</p><p></p><ul><li>One of my teammates did the HPL run and submission, and his group got to come up with the name of the system for        the purposes of HPL. As it turns out, generating a public name for a Top500 submission involves a comical amount        of legal and marketing process when it comes from a giant corporation like Microsoft. And as it turns out,        naming a cluster \"Reindeer\" has a low probability of offending anyone.</li><li>Reindeer is pretty boring--it's a relatively small cluster with a bunch of GPUs. But when you're building out AI        infrastructure at a pace of <a href=\"https://build.microsoft.com/en-US/sessions/984ca69a-ffca-4729-bf72-72ea0cd8a5db\">5x Eagles (70,000            GPUs!) per month</a>, you want the clusters that those GPUs go into to be as boring, predictable, and        automatable as possible. Seeing as how Reindeer only used 960 GPUs but still got #32, it doesn't require much        math to realize that the big hyperscalers could flood the Top500 list with these cookie-cutter GPU clusters and        (in this case) make any ranking below #32 completely irrelevant. Heaven help the Top500 list if they ever        publish an API for submitting new systems; cloud providers' build validation automation could tack a Top500        submission on at the end of burn-in and permanently ruin the list.</li></ul><div>On a personal note, the supercomputer grant that gave me my first job in the HPC business <a href=\"https://www.top500.org/system/177455/\">debuted at #48</a>. It's mind-boggling that I now work in a place    where standing up a #32 system is just day-to-day business.</div><p></p><h3 id=\"tech-expo\">Technology on the exhibit floor</h3><p>The exhibit floor had a few new pieces of HPC technology on display this year that are    worthy of mention, but a lot of the most HPC-centric exciting stuff actually had a soft debut at <a href=\"https://blog.glennklockwood.com/2024/05/isc24-recap.html\">ISC24 in May</a>. For example, even though SC24 was MI300A's big splash due to    the El Capitan announcement, some MI300A nodes (such as the <a href=\"https://glennklockwood.com/garden/nodes/cray-ex255a\">Cray EX255a</a>) were on display in Hamburg. However,    Eviden had their MI300A node (branded XH3406-3) on display at SC24 which was new to me:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>I'm unaware of anyone who's actually committed to a large Eviden MI300A system, so I was    surprised to see that Eviden already has a full blade design. But as with Eni's HPC6 supercomputer, perhaps this is    a sign that AMD's GPUs (and now APUs) have graduated from being built-to-order science experiments to a technology    ecosystem that people will want to buy off the rack.</p><p>There was also a ton of GH200 on the exhibit hall floor, but again, these node types were    also on display at ISC24. This wasn't a surprise since a bunch of upcoming European systems have invested in GH200    already; in addition to JUPITER's 6,000 GH200 nodes described above, <a href=\"https://www.cscs.ch/computers/alps\">CSCS Alps</a> has 2,688 GH200 nodes, and <a href=\"https://glennklockwood.com/garden/systems/isambard-ai\">Bristol's Isambard-AI</a> will have 1,362 GH200    nodes. All of these systems will have a 1:1 CPU:GPU ratio and an NVL4 domain, suggesting this is the optimal way to    configure GH200 for HPC workloads. I didn't hear a single mention of GH200 NVL32.</p><h4 id=\"tech-expo-gb200\">GB200</h4><p>SC24 was the debut of NVIDIA's Blackwell GPU in the flesh, and a bunch of integrators had    material on GB200 out at their booths. Interestingly, they all followed the same pattern as GH200 with an NVL4    domain size, and just about every smaller HPC integrator followed a similar pattern where</p><p></p><ul><li>their booth had a standard \"NVIDIA Partner\" (or \"Preferred Partner!\") placard on their main desk</li><li>they had a bare NVIDIA GB200 baseboard (superchip) on display</li><li>there wasn't much other differentiation</li></ul><p>From this, I gather that not many companies have manufactured GB200 nodes yet, or if they    have, there aren't enough GB200 boards available to waste them on display models. So, we had to settle for these    bare NVIDIA-manufactured, 4-GPU + 2-CPU superchip boards:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>What struck me is that these are very large FRUs--if a single component (CPU, GPU, voltage    regulator, DRAM chip, or anything else) goes bad, you have to yank and replace four GPUs and two CPUs. And because    all the components are soldered down, someone's going to have to do a lot of work to remanufacture these boards to    avoid throwing out a lot of very expensive, fully functional Blackwell GPUs.</p><p>There were a few companies who were further along their GB200 journey and had more    integrated nodes on display. The HPE Cray booth had this GB200 NVL4 blade (the Cray EX154n) on display:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It looks remarkably sparse compared to the super-dense blades that normally slot into the    Cray EX line, but even with a single NVL4 node per blade, the Cray EX cabinet only supports 56 of these blades,    leaving 8 blade slots empty in the optimal configuration. I assume this is a limitation of power and cooling.</p><p>The booth collateral around this blade suggested its use case is \"machine learning and    sovereign AI\" rather than traditional HPC, and that makes sense since each node has 768 GB of HBM3e which is enough    to support training some pretty large sovereign models. However, the choice to force all I/O traffic on to the    high-speed network by only leaving room for one piddly node-local NVMe drive (this blade only supports one SSD per    blade) will make training on this platform very sensitive to the quality of the global storage subsystem. This is    great if you bundle this blade with all-flash Lustre (like Cray ClusterStor) or DAOS (handy, since <a href=\"https://bsky.app/profile/adrianjhpc.bsky.social/post/3lba4yfg5fc2a\">Intel divested the entire DAOS        development team to HPE</a>). But it's not how I would build an AI-optimized system.</p><p>I suspect the cost-per-FLOP of this Cray GB200 solution is much lower than what a pure-play    GB200 for LLM training would be. And since GB200 is actually a solid platform for FP64 (thanks to Dan Ernst for <a href=\"https://bsky.app/profile/ernstdj.bsky.social/post/3lb23ipwnvc26\">challenging me on this</a> and sharing    some <a href=\"https://arxiv.org/abs/2411.12090\">great resources on the topic</a>), I expect to see this node do well    in situations that are not training frontier LLMs, but rather fine-tuning LLMs, training smaller models, and mixing    in traditional scientific computing on the same general-purpose HPC/AI system.</p><p>Speaking of pure-play LLM training platforms, though, I was glad that very few exhibitors    were trying to talk up GB200 NVL72 this year. It may have been the case that vendors simply aren't ready to begin    selling NVL72 yet, but I like to be optimistic and instead believe that the exhibitors who show up to SC24 know that    the scientific computing community likely won't get enough value out of a 72-GPU coherence domain to justify the    additional cost and complexity of NVL72. I didn't see a single vendor with a GB200 NVL36 or NVL72 rack on display    (or a GH200 NVL32, for that matter), and not having to think about NVL72 for the week of SC24 was a nice break from    my day job.</p><p>Perhaps the closest SC24 got to NVL72 was a joint announcement at the beginning of the week    by Dell and CoreWeave, who announced that <a href=\"https://www.coreweave.com/blog/coreweave-pushes-boundaries-with-gb200-and-more\">they have begun bringing        GB200 NVL72 racks online</a>. Dell did have a massive, AI-focused booth on the exhibit floor, and they did talk    up their high-powered, liquid-cooled rack infrastructure. But in addition to supporting GB200 with NVLink Switches,    I'm sure that rack infrastructure would be equally good at supporting nodes geared more squarely at traditional HPC.</p><h4 id=\"tech-expo-ss400\">Slingshot 400</h4><p>HPE Cray also debuted a new 400G Slingshot switch, appropriately named Slingshot 400. I    didn't get a chance to ask anyone any questions about it, but from the marketing material that came out right before    the conference, it sounds like a serdes upgrade without any significant changes to Slingshot's L2 protocol.</p><p>There was a Slingshot 400 switch for the Cray EX rack on display at their booth, and it    looked pretty amazing:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It looks way more dense than the original 200G Rosetta switch, and it introduces    liquid-cooled optics. If you look closely, you can also see a ton of flyover cables connecting the switch ASIC in    the center to the transceivers near the top; similar flyover cables are showing up in all manner of    ultra-high-performance networking equipment, likely reflecting the inability to maintain signal integrity across PCB    traces.</p><p>The port density on Slingshot 400 remains the same as it was on 200G Slingshot, so there's    still only 64 ports per switch, and the fabric scale limits don't increase. In addition, the media is saying that    Slingshot 400 (and the GB200 blade that will launch with it) won't start appearing until \"<a href=\"https://www.nextplatform.com/2024/11/26/hpe-upgrades-supercomputer-lineup-top-to-bottom-in-2025/\">Fall        2025</a>.\" Considering 64-port 800G switches (like <a href=\"https://nvidianews.nvidia.com/news/networking-switches-gpu-computing-ai\">NVIDIA's SN5600</a> and <a href=\"https://www.arista.com/en/company/news/press-release/19493-arista-unveils-etherlink-ai-networking-platforms\">Arista's        7060X6</a>) will have already been on the market by then though, Slingshot 400 will be launching with HPE Cray    on its back foot.</p><p>However, there was a curious statement on the placard accompanying this Slingshot 400    switch:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It reads, \"Ultra Ethernet is the future, HPE Slingshot delivers today!\"</p><p>Does this suggest that Slingshot 400 is just a stopgap until 800G Ultra Ethernet NICs begin    appearing? If so, I look forward to seeing HPE Cray jam third-party 800G switch ASICs into the Cray EX liquid-cooled    form factor at future SC conferences.</p><h4 id=\"tech-expo-gg\">Grace-Grace for storage?</h4><p>One of the weirder things I saw on the exhibit floor was a scale-out storage server built    on NVIDIA Grace CPUs that the good folks at WEKA had on display at their booth.</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>Manufactured by Supermicro, this \"ARS-121L-NE316R\" server (really rolls off the tongue)    uses a two-socket Grace superchip and its LPDDR5X instead of conventional, socketed CPUs and DDR. The rest of it    seems like a normal scale-out storage server, with sixteen E3.S SSD slots in the front and four 400G ConnectX-7 or    BlueField-3 NICs in the back. No fancy dual-controller failover or anything like that; the presumption is that    whatever storage system you'd install over this server would implement its own erasure coding across drives and    servers.</p><p>At a glance, this might seem like a neat idea for a compute-intensive storage system like    WEKA or DAOS. However, one thing that you typically want in a storage server is high reliability and repairability,    features which weren't the optimal design point for these Grace superchips. Specifically,</p><p></p><ul><li>The Grace-Grace superchip turn both CPU sockets into a single FRU. This means that if one CPU goes bad, you're        shipping the whole board back to NVIDIA rather than just doing a field-swap of a socket.</li><li>Grace uses LPDDR5X, whose ECC is not as robust as DDR5. I'm not an expert on memory architecture, but my        understanding is that the ECC scheme on Grace does not provide ChipKill or row failures. And as with CPU        failure, if a single DRAM chip goes back, you're throwing out two CPUs and all the DRAM.</li><li>There's no way to value-engineer the exact quantity of cores, clock, and DRAM to be optimal for the storage        software installed on top of these servers.</li></ul><p>On the upside, though, there might be a cost advantage to using this Grace-Grace server    over a beefier AMD- or Intel-based server with a bunch of traditional DIMMs. And if you really like NVIDIA products,    this lets you do NVIDIA storage servers to go with your NVIDIA network and NVIDIA compute. As long as your storage    software can work with the interrupt rates of such a server (e.g., it supports rebuild-on-read) and the 144 Neoverse    V2 cores are a good fit for its computational requirements (e.g., calculating complex erasure codes), this server    makes sense. But building a parallel storage system on LPDDR5X still gives me the willies.</p><p>I could also see this thing being useful for certain analytics workloads, especially those    which may be upstream of LLM training. I look forward to hearing about where this turns up in the field.</p><p></p><h4 id=\"tech-expo-hbv5\">Microsoft and AMD's new HBM CPU</h4><p>The last bit of new and exciting HPC technology that I noted came from my very own employer    in the form of HBv5, a new, monster four-socket node featuring custom-designed AMD CPUs with HBM. STH wrote up <a href=\"https://www.servethehome.com/this-is-the-microsoft-azure-hbv5-and-amd-mi300c-nvidia/\">an article with        great photos of HBv5 and its speeds and feeds</a>, but in brief, this single node has:</p><p></p><ul><li>384 physical Zen 4 cores (352 accessible from within the VM) that clock up to 4 GHz</li><li>512 GB of HBM3 (up to 450 GB accessible from the VM) with up to 6.9 TB/s STREAM bandwidth</li><li>4x NDR InfiniBand NICs clocked at 200G per port</li><li>200G Azure Boost NIC (160G accessible from the VM)</li><li>8x 1.84 TB NVMe SSDs with up to 50 GB/s read and 30 GB/s write bandwidth</li></ul><p></p><p>The node itself looks kind of wacky as well, because there just isn't a lot on it:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>There are the obvious four sockets of AMD EPYC 9V64H, each with 96 physical cores and 128 GB of HBM3, and giant heat    pipes on top of them since it's 100% air-cooled. But there's no DDR at all, no power converter board (the node is    powered by a DC bus bar), and just a few flyover cables to connect the PCIe add-in-card cages. There is a separate    fan board with just two pairs of power cables connecting to the motherboard, and that's really about it.</p><p>The front end of the node shows its I/O capabilities which are similarly uncomplicated:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>There are four NDR InfiniBand cards (one localized to each socket) which are 400G-capable but cabled up at 200G,    eight E1.S NVMe drives, and a brand-new dual-port Azure Boost 200G NIC. Here's a close-up of the right third of the    node's front:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p>This is the first time I've seen an Azure Boost NIC in a server, and it looksmuch better integrated than the previous-generation 100G Azure SmartNIC that put the FPGA and hard NIC on separateboards connected by a funny little pigtail. This older 100G SmartNIC with pigtail was also on display at the Microsoftbooth in an ND MI300X v5 node:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>And finally, although I am no expert in this new node, I did hang around the people who are all week, and I    repeatedly heard them answer the same few questions:</p><p></p><ul><li><b>Is this MI300C?</b> It is if you want it to be. You can call it Sally if you want; I don't think it will        care. But Microsoft calls it HBv5, and the processor name will show up as AMD EPYC 9V64H in /proc/cpuinfo.</li><li><b>Is its InfiniBand 1x800 port, 2x400 ports, ...?</b> There are four NDR InfiniBand HCA cards, and each card        has one full 400G NDR InfiniBand port. However, each port is only connected up to top-of-rack switching at 200G.        Each InfiniBand HCA hangs off of a different EPYC 9V64H socket so that any memory address can get to        InfiniBand without having to traverse Infinity Fabric. Running four ports of NDR InfiniBand at half speed is an        unusual configuration, but that's what's going on here.</li><li><b>How can I buy this CPU?</b> EPYC 9V64H are \"<a href=\"https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/announcing-azure-hbv5-virtual-machines-a-breakthrough-in-memory-bandwidth-for-hp/4303504\">custom            AMD EPYC processors only available in Azure</a>.\" This means the only way to access it is by provisioning an        HBv5 virtual machine in Azure.</li></ul><div>Amidst all the unrelenting news about new GPUs optimized for AI workloads, it was nice to see something new and    unique launched squarely for the benefit of traditional scientific computing workloads.</div><p></p><p></p><h2 id=\"industry\">The HPC industry overall</h2><div><p>New technology announcements are always exciting, but one of the main reasons I attend        SC and ISC is to figure out the broader trends shaping the HPC industry. What concerns are top of mind for the        community, and what blind spots remain open across all the conversations happening during the week? Answering        these questions requires more than just walking the exhibit floor; it involves interpreting the subtext of the        discussions happening at panels and BOF sessions. However, identifying where the industry needs more information        or a clearer picture informs a lot of the public-facing talks and activities in which I participate throughout        the year.</p></div><h3 id=\"industry-attendee\">What I learned about the average SC technical program attendee</h3><p>The biggest realization that I confirmed this week is that <b>the SC conference is not an HPC        conference; it is a scientific computing conference</b>. I sat in a few sessions where the phrase \"HPC    workflows\" was clearly a stand-in for \"scientific workflows,\" and \"performance evaluation\" still really means \"MPI    and OpenMP profiling.\" I found myself listening to ideas or hearing about tools that were <em>intellectually</em>    interesting but ultimately not useful to me because they    were so entrenched in the traditions of applying HPC to scientific computing. Let's talk about a few ways in which    this manifested.</p><h4 id=\"industry-attendee-sustainability\">People think sustainability and energy efficiency are the same thing</h4><p>Take, for example, the topic of sustainability. There were talks, panels, papers, and BOFs    that touched on the environmental impact of HPC throughout the week, but the vast majority of them really weren't    talking about sustainability at all; they were talking about energy efficiency. These talks often use the following    narrative:</p><p></p><ol><li>Energy use from datacenters is predicted to reach some ridiculous number by 2030</li><li>We must create more energy-efficient algorithms, processors, and scheduling policies</li><li>Here is an idea we tested that reduced the energy consumption without impacting the performance of some        application or workflow</li><li>Sustainability achieved! Success!</li></ol><p>The problem with this approach is that it declares victory when energy consumption is    reduced. This is a great result if all you care about is spending less money on electricity for your supercomputer,    but it completely misses the much greater issue that the electricity required to power an HPC job is often generated    by burning fossil fuels, and that the carbon emissions that are directly attributable to HPC workloads are    contributing to global climate change. This blind spot was exemplified by this slide, presented during a talk titled    \"Towards Sustainable Post-Exascale Leadership Computing\" at the Sustainable Supercomputing workshop:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>I've <a href=\"https://blog.glennklockwood.com/2024/11/fasst-will-be-does-opportunity-to-adapt.html\">written about        this before</a> and I'll write about it again: FLOPS/Watt and PUE are not    meaningful metrics by themselves when talking about sustainability. A PUE of 1.01 is not helpful if the datacenter    that achieves it relies on burning coal for its power. Conversely, a PUE of 1.5 is not bad if all that electricity    comes from a zero-carbon energy source. The biggest issue that I saw being reinforced at SC this year is that    claims of \"sustainable HPC\" are accompanied by the subtext of \"as long as I can keep doing everything else the way I    always have.\"</p><p>There were glimmers of hope, though. Maciej Cytowski from Pawsey presented the opening talk    at the Sustainable Supercomputing workshop, and he led with the right thing--he acknowledged that 60% of    the fuel mix that powers Pawsey's supercomputers comes from burning fossil fuels:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>Rather than patting himself on the back at his low PUE, Dr. Cytowski's described on how    they built their datacenter atop a large aquifer from which they draw water at 21°C and return it at 30°C to avoid    using energy-intensive chillers. To further reduce the carbon impact of this water loop, Pawsey also installed over    200 kW of solar panels on its facility roof to power the water pumps. Given the fact that Pawsey cannot relocate to    somewhere with a higher ratio of zero-carbon energy on account of its need to be physically near the Square    Kilometer Array, Cytowski's talk felt like the most substantive discussion on sustainability in HPC that week.</p><p>Most other talks and panels on the topic really wanted to equate \"sustainability\" to \"FLOPS    per Watt\" and pretend like where one deploys a supercomputer is not a part of the sustainability discussion. The    reality is that, if the HPC industry wanted to take sustainability seriously, it would talk less about watts and    more about tons of CO<sub>2</sub>. Seeing as how the average watt of electricity in Tennessee produces <a href=\"https://www.epa.gov/egrid/data-explorer\">2.75x more carbon</a> than a watt of electricity in Washington,    the actual environmental impact of fine-tuning Slurm scheduling or fiddling with CPU frequencies is meaningless when    compared to the benefits that would be gained by deploying that supercomputer next to a hydroelectric dam instead of    a coal-fired power plant.</p><p>I say all this because there are parts of the HPC industry (namely, the part in which I work)    who <i>are</i> serious about sustainability. And those conversations go beyond simply building supercomputers in    places where energy is low-carbon (thereby reducing <a href=\"https://www.epa.gov/climateleadership/scope-1-and-scope-2-inventory-guidance\">Scope 2 emissions</a>). They    include holding suppliers to high standards on reducing the carbon impact of transporting people and material to    these data centers, reducing the carbon impact of all the excess packaging that accompanies components, and being    accountable for the impact of everything in the data center after it reaches end of life (termed <a href=\"https://www.epa.gov/climateleadership/scope-3-inventory-guidance\">Scope 3 emissions</a>).</p><p>The HPC community--or more precisely, the scientific computing community--is still married    to the idea that the location of a supercomputer is non-negotiable, and \"sustainability\" is a nice-to-have secondary    goal. I was    hoping that the sessions I attended on sustainability would approach this topic at a level where the    non-scientific HPC world has been living. Unfortunately, the discussion at SC24, which spanned workshops, BOFs, and    Green 500, remains largely stuck on the idea that PUE and FLOPS/Watt are the end-all sustainability metrics. Those    metrics are important, but there are global optimizations that have much greater effects on reducing the    environmental impact of the HPC industry.</p><h4 id=\"industry-attendee-ai\">AI sessions are really scientific computing sessions about AI</h4><p>Another area where \"HPC\" was revealed to really mean \"scientific computing\" was in the    topic of AI. I sat in on a few BOFs and panels around AI topics to get a feel for where this community is in    adopting AI for science, but again, I found the level of discourse to degrade to generic AI banter despite the best    efforts of panelists and moderators. For example, I sat in the \"Foundational Large Language Models for    High-Performance Computing\" BOF session, and Jeff Vetter very clearly defined what a \"foundational large language    model\" was at the outset so we could have a productive discussion about their applicability in HPC (or, really,    scientific computing):</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>The panelists did a good job of outlining their positions. On the upside, LLMs are good for    performing source code conversion, documenting and validating code, and maximizing continuity in application codes    that get passed around as graduate students come and go. On the downside, they have a difficult time creating    efficient parallel code, and they struggle to debug parallel code. And that's probably where the BOF should have    stopped, because LLMs, as defined at the outset of the session, don't actually have a ton of applicability in    scientific computing. But as soon as the session opened up to audience questions, the session went off the rails.</p><p>The first question was an extremely basic and nonspecific question: \"Is AI a bubble?\"</p><p>It's fun to ask provocative questions to a panel of experts. I get it. But the question had    nothing to do with LLMs, any of the position statements presented by panelists, or even HPC or scientific computing.    It turned a BOF on \"LLMs for HPC\" into a BOF that might as well have been titled \"Let's just talk about AI!\" A few    panelists tried to get things back on track by talking about the successes of surrogate models to simulate physical    processes, but this reduced the conversation to a point where \"LLMs\" really meant \"any AI model\" and \"HPC\" really    meant \"scientific simulations.\"</p><p>Perhaps the most productive statement to come out of that panel was when Rio Yokota    asserted that \"we\" (the scientific community) should not train their own LLMs, because doing so would be    \"unproductive for science.\" But I, as well as anyone who understands the difference between LLMs and \"AI,\" already    knew that. And the people who don't understand the difference between an LLM and a surrogate model probably didn't    pick up on Dr. Yokota's statement, so I suspect the meaning of his contribution was completely lost.</p><p>Walking out of that BOF (and, frankly, the other AI-themed BOFs and panels I attended), I    was disappointed at how superficial the conversation was. This isn't to say these AI sessions were objectively    <i>bad</i>; rather, I think it reflects the general state of understanding of AI amongst SC attendees. Or perhaps it    reflects the demographic that is drawn to these sorts of sessions. If the SC community is not ready to have a    meaningful discussion about AI in the context of HPC or scientific computing, attending BOFs with like-minded peers    is probably a good place to begin getting immersed.</p><p>But what became clear to me this past week is that SC BOFs and panels with \"AI\" in their    title aren't really meant for practitioners of AI. They're meant for scientific computing people who are beginning    to dabble in AI.</p><h4 id=\"industry-attendee-ops\">AI for operations is not yet real in scientific computing</h4><p>I was invited to sit on a BOF panel called \"Artificial Intelligence and Machine Learning    for HPC Workload Analysis\" following on a successful BOF in which I participated at ISC24. The broad intent was to    have a discussion around the tools, methods, and neat ideas that HPC practitioners have been using to better    understand workloads, and each of us panelists was tasked with talking about a project or idea we had in applying    AI/ML to improve some aspect of workloads.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>What emerged from us speakers' lightning talks is that applying AI for operations--in this    case, understanding user workloads--is nascent. Rather than talking about how we use AI to affect how we design or    operate supercomputers, all of us seemed to focus more on how we are collecting data and beginning to analyze that    data using ML techniques. And maybe that's OK, because AI won't ever do anything for workload characterization until    you have a solid grasp of the telemetry you can capture about those workloads in the first place.</p><p>But when we opened the BOF up to discussion with all attendees, despite having a packed    room, there was very little that the audience had. Our BOF lead, Kadidia Konaté, tried to pull discussion out of the    room from a couple of different fronts by asking what tools people were using, what challenges they were facing, and    things along those lines. However, it seemed to me that the majority of the audience was in that room as spectators;    they didn't know where to start applying AI towards understanding the operations of supercomputers. Folks attended    to find out the art of the possible, not talk about their own challenges.</p><p>As such, the conversation wound up bubbling back up to the safety of traditional topics in    scientific computing--how is LDMS working out, how do you deal with data storage challenges of collecting telemetry,    and all the usual things that monitoring and telemetry folks worry about. It's easy to talk about the topics you    understand, and just as the LLM conversation reverted back to generic AI for science and the sustainability topic    reverted back to FLOPS/Watt, this topic of AI for operations reverted back to standard telemetry collection.</p><h4 id=\"industry-attendee-hyperscale\">Some are beginning to realize that HPC exists outside of scientific computing</h4><p>Despite the pervasive belief at SC24 that \"HPC\" and \"scientific computing\" are the same thing, there are early signs    that the leaders in the community are coming to terms with the reality that there is now a significant amount of    leadership HPC happening outside the scope of the conference. This was most prominent at the part of the Top500 BOF    where Erich Strohmaier typically discusses trends based on the latest publication of the list.</p><p>In years past, Dr. Strohmaier's talk was full of statements that strongly implied that, if a supercomputer is not    listed on Top500, it simply does not exist. This year was different though: he acknowledged that El Capitan,    Frontier, and Aurora were \"the three exascale systems <u style=\"font-style: italic;\">we are aware of</u>,\" now being    clear that there is room for exascale systems to exist that simply never ran HPL, or never submitted HPL results to    Top500. He explicitly acknowledged again that China has stopped making any Top500 submissions, and although he    didn't name them outright, he spent a few minutes dancing around \"hyperscalers\" who have been deploying exascale    class systems such as <a href=\"https://glennklockwood.com/garden/systems/meta's-h100-clusters\">Meta's H100        clusters</a> (2x24K H100), <a href=\"https://glennklockwood.com/garden/systems/colossus\">xAI's        Colossus</a> (100K H100), and the full system behind <a href=\"https://glennklockwood.com/garden/systems/eagle\">Microsoft's Eagle</a> (14K H100 is a \"<a href=\"https://build.microsoft.com/en-US/sessions/984ca69a-ffca-4729-bf72-72ea0cd8a5db\">tiny fraction</a>\").</p><p>Strohmaier did an interesting analysis that estimated the total power of the Top500 list's supercomputers so he could    compare it to industry buzz around hyperscalers building gigawatt-sized datacenters:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It was a fun analysis where he concluded that there are between 500-600 megawatts of supercomputers on the Top500    list, and after you factor in storage, PUE, and other ancillary power sources, the whole Top500 list sums up to what    hyperscalers are talking about sticking into a single datacenter facility.</p><p>Although he didn't say it outright, I think the implication here is that the Top500 list is rapidly losing relevance    in the broad HPC market, because a significant amount of the world's supercomputing capacity <i>and capability</i>    are absent from the list. Although specific hyperscale supercomputers (like Meta's, xAI's, and Microsoft's) were not    mentioned outright, their absence from the Top500 list suggests that this list might already be more incomplete than    it is complete--the sum of the FLOPS or power on the Top500 supercomputers may be less than the sum of the giant    supercomputers which are known but not listed. This will only get worse as the AI giants keep building systems every    year while the government is stuck on its 3-5 year procurement cycles.</p><p>It follows that the meaning of the Top500 is sprinting towards a place where it is not representative of HPC so much    as it is representative of <i>the slice of HPC that serves scientific computing</i>. Erich Strohmaier was clearly    aware of this in his talk this year, and I look forward to seeing how the conversation around the Top500 list    continues to morph as the years go on.</p><h3 id=\"industry-nsf\">NSF's broad front vs. DOE's big bets in HPC and AI</h3><p>My career was started at an NSF HPC center and <a href=\"https://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">built up over my years in the        DOE</a>, so I feel like I owe a debt to the people who provided all the opportunities and mentorship that let me    get to the place of privilege in the hyperscale/AI industry that I now enjoy. As a result, I find myself still    spending a lot of my free time thinking about <a href=\"https://glennklockwood.com/garden/government's-role-in-ai\">the role of governments in the changing face of        HPC</a> (as evidenced by my critiques of <a href=\"https://blog.glennklockwood.com/2024/10/a-critique-of-call-for-public-ai.html\">thinktank reports</a> and <a href=\"https://blog.glennklockwood.com/2024/11/fasst-will-be-does-opportunity-to-adapt.html\">federal RFIs</a>...) and trying to bridge the gap    in technical understanding between my old colleagues (in DOE, NSF, and European HPC organizations) and whatever they    call what I work on now (hyperscale AI?).</p><p>To that end, I found myself doing quite a bit of <i>business development</i> (more on this later) with government    types since I think that is where I can    offer the most impact. I used to be government, and I closely follow the state of their thinking in HPC, but I also    know what's going on inside the hyperscale and AI world. I also have enough context in both areas to draw a line    through all the buzzy AI press releases to demonstrate how the momentum of private-sector investment in AI might    affect the way national HPC    efforts do business. So, I did a lot of talking to both my old colleagues in DOE and their industry partners in an    attempt to help them understand how the hyperscale and AI industry thinks about infrastructure, and what they should    expect in the next year.</p><p>More importantly though, I also sat in on a couple of NSF-themed BOFs to get a better understanding of where their    thinking is, where NAIRR is going, how the NSF's strategy contrasts with DOE's strategy, and where the ambitions of    the Office of Advanced Cyberinfrastructure might intersect with the trajectory of hyperscale AI.</p><p>What I learned was that NSF leadership is aware of everything that the community should be concerned about: the    growth of data, the increasing need for specialized silicon, the incursion of AI into scientific computing, new    business models and relationships with industry, and broadening the reach of HPC investments to be globally    competitive. But beyond that, I struggled to see a cohesive vision for the future of NSF-funded    supercomputing. </p><p>A BOF with a broad range of stakeholders probably isn't the best place to lay out a vision for the future of NSF's    HPC efforts, and perhaps NSF's vision is best expressed through its funding opportunities and awards. Whichever the    case may be, it seems like the NSF remains on a path to make incremental progress on a broad front of topics. Its    Advanced Computing Systems and Services (ACSS) program will continue to fund the acquisition of newer    supercomputers, and a smorgasbord of other research programs will continue funding efforts across public access to    open science, cybersecurity, sustainable software, and other areas. My biggest concern is that peanut-buttering    funding across such a broad portfolio will make net forward progress much slower than taking big bets. Perhaps big    bets just aren't in the NSF's mission though.</p><p>NAIRR was also a topic that came up in every NSF-themed session I attended, but again, I didn't get a clear picture    of the future. Most of the discussion that I heard was around socializing the resources that are available today    through NAIRR, suggesting that the pilot's biggest issue is not a lack of HPC resources donated by industry, but    awareness that NAIRR is a resource that researchers can use. This was reinforced by a survey whose results were    presented in the NAIRR BOF:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It seems like the biggest challenges facing the NSF community relying on NAIRR (which has its own sample bias) is    that they don't really know where to start even though they have AI resources (both GPUs and model API services) at    their disposal. In a sense, this is a great position for the NSF since</p><p></p><ol><li>its users need intellectual help more than access to GPU resources, and the NSF has been great at promoting        education, training, and workforce development.</li><li>its users are unlikely to demand the same cutting-edge GPUs that AI industry leaders are snapping up. For        example, the largest pool of GPUs in NAIRR are A100 GPUs that NVIDIA donated via DGX Cloud; the big AI        companies moved off of Ampere a year ago and are about to move off of Hopper.</li></ol><p></p><p>However, it also means that there's not a clear role for partnership with many industry players beyond donating    resources to the NAIRR pilot today in the hopes of selling resources to the full NAIRR tomorrow. I asked what OAC    leadership thought about moving beyond such a transactional relationship between NSF and industry at one of the BOFs    I attended, and while the panelists were eager to explore specific answers to that question, I didn't hear any ideas    that would approach some sort of truly equitable partnership where both parties contributed in-kind.</p><p>I also walked away from these NSF sessions struck by how different the NSF HPC community's culture is from that of    the DOE. NSF BOF attendees seemed focused on getting answers and guidance from NSF leadership, unlike the typical    DOE gathering, where discussions often revolve around attendees trying to shape priorities to align with their own    agendas. A room full of DOE people tends to feel like everyone thinks they're the smartest person there, while NSF    gatherings appear more diverse in the expertise and areas of depth of its constituents. Neither way is inherently    better or worse, but it will make the full ambition of NAIRR (as an inter-agency collaboration) challenging to    navigate. This is particularly relevant as DOE is now pursuing its own multi-billion-dollar AI infrastructure    effort, FASST, that appears to sidestep NAIRR.</p><h3 id=\"industry-expo\">Exhibitor trends</h3><p>There's no better way to figure out what's going on in the HPC industry than walking the    exhibit floor each year, because booths cost money and reflect the priorities (and budgets) of all participants.    This year's exhibit felt physically huge, and walking from one end to the other was an adventure. You can get a    sense of the scale from this photo I took during the opening gala:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>Despite having almost 18,000 registrants and the opening gala usually being acrush of people, the gala this year felt and looked very sparse just because people and booths were more spread out.There was also a perceptibly larger number of splashy vendors who have historically never attended before who werepromoting downstream HPC technologies like data center cooling and electrical distribution, and there was healthyspeculation online about whether the hugeness of the exhibit this year was due to these new power and cooling companies.</p><p></p><p>To put these questions to rest, I figured out how to yank down all the exhibitor metadata    from the conference website so I could do some basic analysis on it.</p><h4 id=\"industry-expo-booths\">Booths by the numbers</h4><p>The easiest way to find the biggest companies to appear this year was to compare the    exhibitor list and booth sizes from SC23 to this year and see whose booth went from zero to some big square footage.</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>I only took the top twenty new vendors, but they broadly fall into a couple of categories:</p><p></p><ul><li><b>Power and cooling</b>: Stulz, Delta, Airedale, Valvoline, Boundary Electric, Schneider Electric, Mara        </li><li><b>Server manufacturing</b>: Wistron, AMI, Pegatron</li><li><b>Higher ed</b>: Tennessee Tech, SCRCC</li></ul><p>There were a couple other companies that must've just missed last SC but aren't new to        the show (NetApp, Ansys, Samsung, Micron, Broadcom). And curiously, only one new GPU-as-a-Service provider        (Nebius) showed up this year, suggesting last year was the year of the GPU Cloud.</p><p>But to confirm what others had speculated: yes, a significant amount of the new square        footage of the exhibit floor can be attributed to companies focused on power and cooling. This is an interesting        indicator that HPC is becoming mainstream, largely thanks to AI demanding ultra-high density of power and        cooling. But it's also heartening to see a few new exhibitors in higher education making an appearance. Notably,        SCRCC (South Carolina Research Computing Consortium) is a consortium between Clemon, University of South        Carolina, and Savannah River National Laboratory that just formed last year, and I look forward to seeing what        their combined forces can bring to bear.</p><p>We can also take a look at whose booths grew the most compared to SC23:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>This distribution is much more interesting, since the top 20 exhibitors who grew their footprint comprise the        majority of the growth in existing exhibitors. Cherry-picking a few interesting growers:</p><p></p><ul><li><b>Power and cooling</b>: USystems, Midas, Vertiv</li><li><b>Data center/GPUaaS</b>: iM, Iris Energy, and (arguably) Oracle</li><li><b>Software</b>: Arc Compute and CIQ</li><li><b>Companies facing serious financial or legal troubles</b>: I count at least three! Impressive that they            are still pouring money into their SC booths.</li></ul><p>It's also interesting to see HLRS, the German national HPC center, grow so        significantly. I'm not sure what prompted such a great expansion, but I take it to mean that things have been        going well there.</p><p>Finally, Dell had a massive booth and showing this year. Not only did they grow the        most since SC23, but they had the single largest booth on the exhibit floor at SC24. This was no doubt a result        of their great successes in partnering with NVIDIA to land massive GPU buildout deals at places like <a href=\"https://qz.com/dell-super-micro-computer-stock-elon-musk-ai-nvidia-1851550428\">xAI</a> and <a href=\"https://www.tomshardware.com/tech-industry/artificial-intelligence/dell-reaches-milestone-with-industrys-first-enterprise-ready-nvidia-blackwell-poweredge-xe9712-server-racks\">CoreWeave</a>.        They also had \"AI factory\" messaging emblazoned all over their marketing material and debuted a nice 200 kW        liquid-cooled rack that will be the basis for their GB200 NVL72 solution, clearly leaning into the idea that        they are leaders in AI infrastructure. Despite this messaging being off-beat for the SC audience as I've        described earlier, their booth was surprisingly full all the time, and I didn't actually get a chance to get in        there to talk to anyone about what they've been doing.</p><p>Equally interesting are the vendors who reduced their footprint at SC24 relative to        SC23:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>Reading too much into any of these big shrinkers is pretty easy; while a reduction in        booth size could suggest business hasn't been as good, it could equally mean that an exhibitor just went        overboard at SC23 and downsized to correct this year. A few noteworthy exhibitors to call out:</p><p></p><ul><li>Penguin and the Korea Semiconductor Industry Association both cut way back from massive 50x50 booths to            30x30. Their booths this year were both big, but they weren't massive. Viridien, formerly known as CGG, also            shrunk from a massive booth to a less-massive 30x40.</li><li>Juniper still kept an independent booth, but it is in the <a href=\"https://www.hpe.com/us/en/newsroom/press-release/2024/01/hpe-to-acquire-juniper-networks-to-accelerate-ai-driven-innovation.html\">process                of being absorbed into HPE</a>. Shrinking makes sense.</li><li>Major cloud providers Google and AWS scaled back, but Microsoft did not.</li><li>GPU-as-a-Service cloud providers CoreWeave and Lambda both scaled back. Since these GPUaaS providers'            business models typically rely on courting few big customers, it may make sense to cut back on booth volume.        </li><li>Major AI storage companies DDN, VAST, and (to a lesser degree) Pure also scaled back, while WEKA did not. I            know business for DDN and VAST has been great this past year, so these may just reflect having gone            overboard last year.</li></ul><p>Overall, almost twice as many vendors grew their booths than scaled back, so I'd        caution anyone against trying to interpret any of this as anything beyond exhibitors right-sizing their booths        after going all-in last year.</p><p>Finally, there are a handful of vendors who disappeared outright after SC23:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It is critical to point out that the largest booths to vanish outright were all on the        smaller size: SUSE, Tenstorrent, and Symbiosys Alliance all disappeared this year, but their booths last year        were only 20x30. I was surprised to see that Tenstorrent and Arm didn't have booths, but the others are either        companies I haven't heard of (suggesting the return on investment of showing at SC might've been low), are easy        to rationalize as only being HPC-adjacent (such as SNIA and DigitalOcean), or simply went bankrupt in the last        year.</p><p>As we say at the business factory, the net-net of the exhibit hall this year is that        the square footage of booth space increased by 15,000 square feet, so it was in fact bigger, it did take longer        to walk from one end to the other, and there definitely were a bunch of new power and cooling companies filling        out the space. Some exhibitors shrank or vanished, but the industry as a whole appears to be moving in a healthy        direction.</p><p>And if you're interested in analyzing this data more yourself, please have a look at <a href=\"https://github.com/glennklockwood/sc-exhibitors\">the data and the Jupyter notebook I used to generate            the above treemaps on GitHub</a>. If you discover anything interesting, please write about it and post it        online!</p><p></p><p></p><h4 id=\"industry-expo-gpuaas\">Proliferation of GPU-as-a-Service providers</h4><p>As an AI infrastructure person working for a major cloud provider, I kept an eye out for all the companies trying        to get into the GPU-as-a-Service game. <a href=\"https://blog.glennklockwood.com/2023/11/sc23-recap.html\">I described these players last year as            \"pure-play GPU clouds,\"</a> and it seems like the number of options available to customers who want to go        this route is growing. But I found it telling that a lot of them had booths that were completely        indistinguishable from each other. Here's an example of one:</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>As best I can tell, these companies are all NVIDIA preferred partners with    data centers and a willingness to deploy NVIDIA GPUs, NVIDIA SmartNICs, and NVIDIA cloud stack, and sell multi-year    commitments to consume those GPUs. I tried to accost some of these companies' booth staff to ask them my favorite    question (\"What makes you different from everyone else?\"), but most of these companies' booths were staffed by    people more interested in talking to each other than me.</p><p>These GPUaaS providers tend to freak me out, because, as Microsoft's CEO recently stated, these companies are        often \"<a href=\"https://www.microsoft.com/en-us/Investor/events/FY-2025/earnings-fy-2025-q1\">just a bunch of            tech companies still using VC money to buy a bunch of GPUs</a>.\" I can't help but feel like this is where        the AI hype will come back to bite companies who have chosen to build houses upon sand. Walking the SC24 exhibit        floor is admittedly a very narrow view of this line of business, but it seemed like some of these companies were        content to buy up huge booths, hang a pretty banner above it, and otherwise leave the booth empty of anything        beyond a few chairs and some generic value propositions. I didn't feel a lot of hunger or enthusiasm from these        companies despite the fact that a bunch of them have hundreds of millions of dollars of GPUs effectively sitting        on credit cards that they are going to have to make payments on for the next five years.</p><p>That all said, not all the companies in the GPUaaS are kicking back and letting the money pour in. In particular,        I spent a few minutes chatting up someone at the CoreWeave booth, and I was surprised to hear about how much        innovation they're adding on top of their conventional GPUaaS offering. For example, they developed <a href=\"https://docs.coreweave.com/coreweave-machine-learning-and-ai/training/sunk\">Slurm on Kubernetes            (SUNK)</a> with one of their key customers to close the gap between the fact that CoreWeave exposes its GPU        service through Kubernetes, but many AI customers have built their stack around Slurm, <a href=\"https://github.com/NVIDIA/pyxis\">pyxis</a>, and <a href=\"https://github.com/NVIDIA/enroot\">enroot</a>.    </p><p>In a weird twist of fate, I later ran into an old acquaintance who turned out to be one of the key CoreWeave        customers for whom SUNK was developed. He commented that SUNK is the real deal and does exactly what his users        need which, given the high standards that this person has historically had, is a strong affirmation that SUNK is        more than just toy software that was developed and thrown on to GitHub for an easy press release. CoreWeave is        also developing some interesting high-performance object storage caching software, and all of these software        services are provided at no cost above whatever customers are already paying for their GPU service.</p><p>I bring this up because it highlights an emerging distinction in the GPUaaS market, which used to be a homogenous        sea of bitcoin-turned-AI providers. Of course, many companies still rely on that simple business model: holding        the bill for rapidly depreciating GPUs that NVIDIA sells and AI startups consume. However, there are now GPUaaS        providers moving up the value chain by taking on the automation and engineering challenges that model developers        don't want to deal with. Investing in uncertain projects like new software or diverse technology stacks is        certainly risky, especially since they may never result in enough revenue to pay for themselves. But having a        strong point of view, taking a stance, and investing in projects that you feel are right deserves recognition.        My hat is off to the GPUaaS providers who are willing to take these risks and raise the tide for all of us        rather than simply sling NVIDIA GPUs to anyone with a bag of money.</p><h2 id=\"community\">Community and connections</h2><p>As much as I enjoy <i>increasing shareholder value</i>, the part of SC that gives me the    greatest joy is reconnecting with the HPC community. Knowing I'll get to chat with my favorite people in the    industry (and meet some new favorite people!) makes the long plane rides, upper respiratory infections, and weird    hotel rooms completely worth it.</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>I wound up averaging under six hours of sleep per night this year in large part because 9pm    or 7am were often the only free times I had to meet with people I really wanted to see. I have this unhealthy    mindset where every hour of every day, from the day I land to the day I leave, is too precious to waste, and it's    far too easy for me to rationalize that spending an hour talking to someone interesting is worth losing an hour of    sleep.</p><p>But like I said at the outset of this blog post, this year felt different for a few    reasons, and a lot of them revolve around the fact that I think I'm getting old. Now, it's always fun to say \"I'm    getting old\" in a mostly braggadocious way, but this feeling manifested in concrete ways that affected the way I    experienced the conference:</p><p></p><ol><li>I hit my limit on Monday night and couldn't get home without spending 15 minutes sitting in an unlit playground        across from the World of Coke. I've always gotten blisters and fatigue, but this was the first time I couldn't        just cowboy up and muscle through it. To avoid a repeat of this, I wound up \"wasting\" (see above) a lot more        time to just get off my feet this year.</li><li>This year, I reached the point where I need to start time-box how much time I spend chatting up the folks I        bump into. I used to just let the good times roll if I ran into someone I knew, but this year I wound up        spending as much time attending sessions as I did missing sessions because I got caught up in a conversation.        This isn't a bad thing per se, but I did feel a little sour when I realized I'd made a bad bet on choosing to        chat instead of attending a session or vice versa, and this bad feeling lingered in the back of my mind just        about every day.</li><li>There weren't a lot of surprises for me at the conference this year, and I worry that I am at risk of losing        touch with the technical aspects of the conference that get newer attendees excited. Instead of hearing about,        say, the latest research in interconnects, more of my time was spent mucking it up with the sorts of people in        the HPC community who I used to find intimidating. On the one hand, hooray me for making it into old boys'        clubs. But on the other, I don't want to become some HPC greybeard whose last meaningful contribution to the        industry was twenty years ago.</li><li>This is the first year where I've had people accost me <i>and ask me for advice</i>. I've long been accosted by        strangers because of my online presence, but those interactions were always lighthearted exchanges of \"I follow        you on Twitter\" and \"Great to meet you. Have an @HPC_Guru pin.\" This year, I had people specifically ask me for        advice on industry versus postdoc, AI versus HPC, and what my master plan was when I left NERSC. Even though I        didn't have any sage advice, I still found it really hard to tell bright-eyed students to go kick rocks just so        I wouldn't be late for yet another mushy panel on AI.</li></ol><p>If you read this all and think \"boo hoo, poor Glenn is too popular and wise for his own    good,\" yeah, I get it. There are worse problems to have. But this was the first year where I felt like what I put    into the conference was greater than what I got out of it. Presenting at SC used to be at least as good for my    career as it was useful for my audiences, but it just doesn't count for much given my current role and career stage.    It felt like some of the magic was gone this year in a way I've never experienced before. </p><p></p><h3 id=\"community-people\">Getting to know people</h3><p>As the years have gone on, I spend an increasing amount of my week having one-on-one    conversations instead of wandering aimlessly. This year though, I came to SC without really having anything to buy    or sell:</p><p></p><ul><li>I am not a researcher, so I don't need to pump up the work I'm doing to impress my fellow researchers.</li><li>I no longer own a product market segment, so I don't directly influence the customers or vendors with whom my        employer works.</li><li>I don't have any bandwidth in my day job to support any new customers or partnerships, so I don't have a strong        reason to sell people on partnering with me or my employer. </li></ul><p>Much to my surprise though, a bunch of my old vendor/partner colleagues still wanted to get    together to chat this year. Reflecting back, I was surprised to realize that it was these conversations--not the    ones about business--that were the most fulfilling this year.</p><p>I learned about people's hobbies, families, and their philosophies on life, and it was    amazing to get to know some of the people behind the companies with whom I've long dealt. I was reminded that the    person is rarely the same as the company, and even behind some of the most aggressive and blusterous tech companies    are often normal people with the same concerns and moments of self-doubt that everyone else has. I was also reminded    that good engineers appreciate good engineering regardless of whether it's coming from a competitor or not. The    public persona of a tech exec may not openly admire a competitor's product, but that doesn't mean they don't know    good work when they see it.</p><p>I also surprised a colleague whose career has been in the DOE labs with an anecdote that    amounted to the following: even though two companies may be in fierce competition, the people who work for them    don't have to be. The HPC community is small enough that almost everyone has got a pal at a competing company, and    when there are deals to be made, people looove to gossip. If one salesperson hears a juicy rumor about a prospective    customer, odds are that everyone else on the market will hear about it pretty quickly too. Of course, the boundaries    of confidentiality and professionalism are respected when it matters, but the interpersonal relationships that are    formed between coworkers and friends don't suddenly disappear when people change jobs.</p><p>And so, I guess it would make sense that people still want to talk to me even though I have    nothing to buy or sell. I love trading gossip just as much as everyone else, and I really enjoyed this aspect of the    week.</p><p></p><h3 id=\"community-career\">Talking to early career people</h3><p>I also spent an atypically significant amount of my week talking to early career people in    HPC who knew of me one way or another and wanted career advice. This is the first year I recall having the same    career conversations with multiple people, and this new phase of my life was perhaps most apparent during the IEEE    TCHPC/TCPP HPCSC career panel in which I was invited to speak this year.</p><p></p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure></figure></div><p></p><p>It was an honor to be asked to present on a career panel, but I didn't feel very qualified to give career advice to    up-and-coming computer science graduate students who want to pursue HPC. I am neither a computer scientist nor a    researcher, but fortunately for me, my distinguished co-panelists (Drs. Dewi Yokelson, Olga Pearce, YJ Ji, and    Rabab Alomairy) had plenty of more relevant wisdom to share. And at the end of the panel, there were a few things we    all seemed to agree on as good advice:</p><p></p><ol><li>Knowing stuff is good, but being able to learn things is better. Being eager to learn and naturally curious        makes this much easier as well.</li><li>The life of a researcher sometimes requires more than working a standard nine-to-five, so it'll be hard to be        really successful if your heart isn't in it.</li><li><a href=\"https://quoteinvestigator.com/2014/04/06/they-feel/\">People will forget what you did or what you said,            but they remember how you made them feel</a>. Don't be a jerk, because this community is small.</li></ol><p></p><p>In both this panel the one-on-one conversations I had with early career individuals, the best I could offer was the    truth: I never had a master plan that got me to where I am; I just try out new things until I realize I don't like    doing them anymore. I never knew what I wanted to be when I grew up, and I still don't really, so it now makes me    nervous that people have started approaching me with the assumption that I've got it all figured out. Unless I    torpedo my career and go live on a goat farm though, maybe I should prepare for this to be a significant part of my    SC experiences going forward.</p><h3 id=\"community-bsky\">Shift in social media</h3><p>One last, big change in the community aspect of SC this year was the mass-migration of a ton of HPC folks from    Twitter to Bluesky during the week prior to the conference. I don't really understand what prompted it so suddenly;    a few of us have been trying for years to get some kind of momentum on other social platforms like Mastodon, but the    general lack of engagement meant that all the excitement around SC always wound up exclusively on Twitter. This year    was different though, and Bluesky hit critical mass with the HPC community.</p><p>I personally have never experienced an SC conference without Twitter; my first SC was in 2013, and part of what made    that first conference so exciting was being able to pull up my phone and see what other people were seeing,    thinking, and doing across the entire convention center via Twitter. Having the social media component to the    conference made me feel like I was a part of something that first year, and as the years went on, Twitter became an    increasingly indispensable part of the complete SC experience for me.</p><p>This year, though, I decided to <a href=\"https://x.com/glennklockwood/status/1857571101028790498\">try an        experiment</a> and see what SC would be like if I set Twitter aside and invested my time into Bluesky instead.</p><p>The verdict? <i>It was actually pretty nice.</i></p><p>It felt a lot like the SC13 days, where my day ended and began with me popping open Bluesky to see what new <a href=\"https://bsky.app/hashtag/sc24\">#SC24</a> posts were made. And because many of the tech companies and HPC    centers hadn't yet made it over, the hashtag wasn't clogged up by a bunch of prescheduled marketing blasts that    buried the posts written by regular old conference attendees who were <a href=\"https://bsky.app/profile/walkingrandomly.bsky.social/post/3lbazofprgc2y\">asking important questions</a>:</p><blockquote class=\"bluesky-embed\"><p lang=\"en\">Which booths at #sc24 have coffee? I noticed oracle do. Anyone else?</p>— Mike Croucher (<a href=\"https://bsky.app/profile/did:plc:sd6xejkhcmyehbscxb5lz3uq?ref_src=embed\">@walkingrandomly.bsky.social</a>) <a href=\"https://bsky.app/profile/did:plc:sd6xejkhcmyehbscxb5lz3uq/post/3lbazofprgc2y?ref_src=embed\">November 18, 2024 at 3:02 PM</a></blockquote><p>Of course, I still clogged Bluesky up with my nonsense during the week, but there was an amazing amount of    engagement by a diversity of thoughtful people--many who came from Twitter, but some whose names and handles I    didn't recognize.</p><p>The volume of traffic on Bluesky during the week did feel a little lower than what it had been on Twitter in years    past though. I also didn't see as many live posts of technical sessions as they happened, so I couldn't really tell    whether I was missing something interesting in real time. This may have contributed to why I felt a little less    connected to the pulse of the conference this year than I had in the past. It also could've been the fact that    conference was physically smeared out across a massive space though; the sparsity of the convention center was at    least on par with the sparsity on Bluesky.</p><p>At the end of the week, I didn't regret the experiment. In fact, I'll probably be putting more effort into my Bluesky    account than my Twitter account going forward. To be clear though, this isn't a particularly political decision on    my part, and I pass no judgment on anyone who wants to use one platform over the other. It's just that I like the    way I feel when I scroll through my Bluesky feeds, and I don't get that same feeling when I use Twitter.</p><h2 id=\"conclusion\">So what's the takeaway?</h2><p>SC this year was a great conference by almost every measure, as it always is, but it still felt a little different for me. I'm sure that some of that feeling is the result of my own growth, and my role with respect to the conference seems to be evolving from someone who gets a lot out of the conference to someone who is giving more to the conference. That's not to say that I don't get a lot out of it, though; I had no shortage of wonderful interactions with everyone from technology executives to rising stars who are early in their career, and I learned a lot about both them and me as whole people. But SC24, more than any SC before it, is when I realized this change was happening.</p><p>On the technological front, we saw the debut of a new #1 system (emblazoned with the smiling face of Bronis...) and a growing crop of massive, new clusters deployed for commercial applications. The exhibit floor was quantitatively bigger, in large part due to new power and cooling companies who are suddenly relevant to the HPC world thanks to the momentum of AI. At the same time, the SC technical program is clearly separating itself out as a conference focused on scientific computing; the level of discourse around AI remains largely superficial compared to true AI conferences, the role of hyperscalers in the HPC industry is still cast more as a threat than an opportunity.</p><p>For my part, I'm still trying to get a grasp on where government agencies like DOE and NSF want to take their AI ambitions so I can try to help build a better mutual understanding between the scientific computing community and the hyperscale AI community. However, it seems like the NSF is progressing slowly on a wide front, while the DOE is doing what DOE does and charging headfirst into a landscape that has changed more than I think they realize.</p><p>There's a lot of technical content that I know I missed on account of the increasing time I've been spending on the people and community aspect of the conference, and I'm coming to terms with the idea that this just may be the way SC is from now on. And I think I'm okay with that, since the support of the community is what helped me go from being a bored materials science student into someone whose HPC career advice is worth soliciting in the short span of eleven years. Despite any or all of the cynicism that may come out in the things I say about this conference, SC is always the highlight of my year. I always go into it with excitement, gladly burn the candle at both ends all week, and fly home feeling both grateful for and humbled by everything the HPC community has done and continues to do to keep getting me out of bed in the morning.</p><p></p>",
            "url": "https://hpc.social/personal-blog/2024/sc-24-recap/",
            
            
            
            
            
            "date_published": "2024-12-02T07:30:00-07:00",
            "date_modified": "2024-12-02T07:30:00-07:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/surfing-the-singularity-the-workflow-is-the-app/",
            "title": "Surfing the Singularity - \"the Workflow is the App\"",
            "summary": null,
            "content_text": "Hello and happy fall holidays to you and yours. As I wrote about in the last blog post [1], as quantum computing hardware matures over the next 5 to 10 years from an experimental toy through to utility and then perhaps advantage over classical (for some applications), it will be included into an already diverse and hybrid computing and applications landscape - on-prem computing, mobile and edge, cloud, and now novel types of computing devices which require new thinking and wholly new means of addressing them. How to deal with the burgeoning heterogeneity of the computing landscape - how to write and run apps which produce and consume data across a widening array of devices- is the topic of this post. Language Landscape The Java programming language, once touted in the glory days of \"the World Wide Web\" as being \"write once, deploy anywhere\", and in its heyday representing 25% of new application development, is now down below 10%. What's hot? Python (23%), and \"the C's\", a collection of C, C++, C# and their kin (&gt;24% in total) which are traditionally recompiled for specific hardware platforms. [2] And while Python provides portability, often for performance in math operations it depends on native libraries, built in, you guessed it, the C's. Into this mix wades the US government which has come out recently with a potentially disruptive statement against the use of the C's, citing security concerns due to their free-wheeling memory management, and in spite of efforts like Safe C++, the government is recommending movement to memory safe languages like Rust, currently with just 1% market share, but \"with a bullet\". [3] Whether it is better to port to Rust or just update to Safe C++ depends on many factors - for example, how good are your docs and test cases - and while there may exist conceptual impedance mismatches between languages, modern AI coding assistants will only increase in capability especially for more rote tasks like porting.Add to this mix the coding of Graphical Processing Units (GPUs) - originally intended for visualizations but now used in applications for almost anything involving matrix math (turns out, lots of stuff). GPUs today are mostly sold by NVIDIA and are programmed in the C's (sometimes with a Python interface) using the NVIDIA CUDA library. These pieces of the application, the \"kernels\", are hardware dependent, and while many attempts have been made to create hardware-portable frameworks for GPU programming (see SYCL for example [4]), nearly always the newest fastest GPU features are available in the native non-portable form first, leading to vendor lock. (This might be a good time to remember that NVIDIA does not themselves manufacture chips - they design chips which others produce.)The manner in which we program GPUs is similar to the way we program quantum computers, i.e. QPUs - we delegate to them the portions of the application to which they are best suited, program them using device-specific instructions, and weave them back into the holistic solution. Rather than wielding the Java hammer where everything is a virtualized nail, we use the best tool for the job. In quantum computing, for example, \"variational\" hybrid algorithms are a common theme, where some part of the work and preparation are performed on classical hardware as a setup for a quantum step, and then post-processing the results back on classical hardware for potential iteration to an optimal solution. Two of several emerging patterns for integrating quantum computing into an application solution. [5]This pattern is analogous to what is also common in classical high performance computing (HPC) for applications like weather modeling and other complex simulations - pre-process on commodity hardware, run an HPC job on the big box, and post-process the results. The introduction into the mix of steerage provided by AI models increases the heterogeneity of the complete solution. A blended computing landscape, enabling for example, quantum computing to produce highly precise data to train AI to steer a classical HPC simulation. [6]All these hardware-dependent application pieces for an ever widening array of hardware means that compilers are cool again, and compiler pipelines like LLVM are critical to application development and deployment. [7] Included in this class of development tools are circuit transpilers for quantum hardware which must take into consideration not only the architectural differences between QPUs (e.g. which gates are supported, what's the inter-qubit connectivity like, etc.), but also the changes which can occur in a quantum data center on a daily basis as these new, noisy, and fragile qubits simply fail and go offline, potentially altering the machine's topology. Just-in-time compilation is needed, and compiler optimization is therefore also cool again. Thank you, Frances Allen. [8] Parts is PartsWhat emerges from this landscape is not a singular executable running on one computer, but rather, multiple application piece parts, written in different languages, running on radically different hardware in sequence and simultaneously, being orchestrated into a complete solution.In other words, a workflow. Back in the day Java's Sun Microsystems (remember them?) asserted \"the network is the computer\". Now we assert \"the workflow is the app\". Or more likely, a workflow of workflows. We like to think of these nested workflows in three types: [9]in-situ: the workflow is running all on the same machine (e.g. a local process, an HPC job)intra-site: the workflow is running on different machines within the same connected enterprise (e.g. within the same data center, virtual network, etc.)inter-site: the workflow is running across different machines in different enterprises (e.g. hybrid on-prem and perhaps multi-vendor cloud)With all these compute types, languages, and locations working together to realize the workflow and solution, loose coupling is key - components connected but not dependent - each part minding its own business. In other words, to paraphrase the poet, good interfaces make good neighbors. [10]We use the convenience term \"Site\" to mean a provider of secure compute and data services. What interfaces must a Site provide? The interface or API can include lots of things, but it must at least provide: 1) authentication and authorization, 2) a means to run components through their lifecycle, 3) a means to manage data being operated on and produced, perhaps being moved into and out of the Site, and 4) some way to get an inventory of the Site's service offerings and provision them for the purposes of running components or holding data. We call these by four functional nicknames: Auth, Run, Repo, and Spin. Four functional pillars of an interoperable computing site.We can see in each of the three types of workflows the need for each of these four functional pillars, albeit some as a no-op or inherited from a higher order workflow. For example, in a \"type 1\" workflow of components running on a single machine or within an HPC allocation the Auth aspects may be implied to be already addressed - i.e. the user is already logged into the machine or authorized to run on the HPC cluster. But a workflow which utilizes compute resources both on-prem and in the cloud will have to interact at runtime with the \"auth\" aspects of the cloud provider prior to being able to \"run\" workloads, or put and get data to various \"repos\". Most cloud providers provide a means to list available computing resources, to \"spin\" them up and down. This provisioning itself can be part of an end-to-end workflow: authenticate, get an inventory of available services, spin some up, run jobs on them storing the results, and spin them down. Stuck in the MiddleMost cloud providers - from Amazon to IBM Quantum cloud - provide a callable API interface which can be viewed through the lens of Auth, Run, Repo, Spin. So do some of the supercomputers and cutting edge resources provided by the Federal government, most notably those provided by the National Energy Research Scientific Computing Center (NERSC). [11] As Sites, these providers expose their offerings to internal and external workflows, however, they do not themselves promote a means to author these cross-site workflows, to manage them, track them, or keep tabs on all that distributed data. What else is needed? First, since cloud and other service providers have no motivation to standardize their interfaces, a framework super-interface could exist with the ability to plug in drivers for specific service providers. This in theory is the Auth, Run, Repo, Spin interface. Second, since each provider defines their own service and runtime component lifecycle (loosely: start, run, and stop with success or fail end states) there needs to be a way to normalize the status terminology - a \"fail\" on one site is the same as an \"error\" on another, \"success\" means the same thing as \"done\". This permits the third aspect of a middleware framework - the ability to track running jobs on Sites and trigger other jobs on any Site to run accordingly - i.e. the control flow of the workflow. What about the data? Commonly we need the ability to put data to a Site and get some back - this is the Repo interface of the Site. And while most (but not all) Sites provide some means to store and retrieve data, be it filesystem or S3 object store or database or something else, it would also be nice to be able to say something \"meta\" about the data - which Site did it come from, what job or application produced it, what other workflow steps on this Site or others consumed it? Some Sites provide storage with metadata (e.g. Amazon S3) but most don't. This metadata comprises the provenance of the data - like a Civil War sword on the Antiques Roadshow, its the paper trail showing where the item came from, proving the item is legit. In a workflow which perhaps produces many pieces of data, perhaps iteratively as it converges on a solution - keeping track of all the data pieces seems, well, important. The acronym FAIR - findable, accessible, interoperable, reusable - seems a good starting point. [12]Open Says MeOur open source project lwfm, the \"local workflow manager\", attempts to render these concepts as a reference implementation. [13] Its small with minimal Python lib dependencies and can be taken anywhere easily as a single runnable component, its provenancial metadata also easily portable and importable. A typical Site driver - a Python class which implements the Site interface - weighs in around 200 lines of code including the whitespace. Armed with a Site driver for a cloud service, you can author long-running workflows which utilize a mix of compute resources, storage, and data infrastructures, and automatically track the provenancial paper trail. The lwfm middleware component provides some very recognizable services:polling of remote job status status normalization and persistencesystem and user metadata persistenceevent handling, control flow and data flow triggeredShould you use this tooling? I wouldn't recommend it. (Huh? Did I hear you correctly?) How many people are working maintaining it? (Two?) What about the community? (Next to none.) The software would fare poorly on a \"spider web\" analysis of its overall quality - you would not want to recommend it to your boss.A convenient multi-axis assessment framework for software model maturity. [14]The lwfm is a reference implementation of a workflow interop framework, at best. Are there alternatives? OMG are there alternatives! The workflow landscape is notoriously rich, fragmented, and super-niched. But portability and interoperability are often neglected as is data provenance. Government or university projects, while well meaning and sometimes directionally correct, quickly go stale when the funding elapses [15], and commercial solutions while often suffering some of the same deficiencies offer the added trap of vendor lock and can come with a hefty price tag.Order, OrderSo its back to committee. [16] Next week the high performance computing community will be meeting again at the SC Conference Series Supercomputing 2024, this year in Atlanta. Hybrid workflows for scientific and engineering applications - involving classical HPC, AI-focused clusters, and now also quantum computers - will be among the very many topics discussed.[17] And we should expect some surprises - in the new rankings for example of top machines on the planet, at least, the ones they want us to know about. [18]Perhaps I'll report back on some of those returns in a future blog. Best regards. - andy References &amp; Amusements [0] Banner photo by Ben Wicks on Unsplash[1] \"Surfing the Singularity: The Universe Computes\", A. Gallo, https://www.linkedin.com/pulse/surfing-singularity-universe-computes-andy-gallo-6fgle[2] TIOBE ranking of programming language popularity: https://www.tiobe.com/tiobe-index/[3] Safe C++, with some chronology of the government statements: https://safecpp.org/[4] SYCL: https://www.khronos.org/sycl/[5] \"Post-variational quantum neural networks\", https://pennylane.ai/qml/demos/tutorial_post-variational_quantum_neural_networks[6] \"Hope Versus Hype: Quantum, AI and the Path to Commercial Advantage\", Matthias Troyer, presentation at IEEE Quantum Week, Montreal, September 2024.[7] LLVM: https://llvm.org/[8] https://amturing.acm.org/award_winners/allen_1012327.cfm[9] \"Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-Modal Power Systems Design Workflows\", A. Gallo et al, https://drive.google.com/file/d/1c3YEVmEAUjbI5urj4PiV2TtjzBUzLlws[10] \"Mending Wall, Robert Frost, https://www.poetryfoundation.org/poems/44266/mending-wall[11] NERSC SuperFacility API: https://docs.nersc.gov/services/sfapi/[12] \"The FAIR GuidingPrinciples for scientific data management and stewardship\", Mark D. Wilkinson et al., https://pmc.ncbi.nlm.nih.gov/articles/PMC4792175/pdf/sdata201618.pdf[13] lwfm, https://github.com/lwfm-proj/lwfm [14] \"Model Maturity Web\", https://richardarthur.medium.com/co-design-web-6f37664ac1e1[15] Them's fighting words, and I expect to be roasted for it. But it seems to me that even the most popular software tool kits (no names) which emerged from the massively government funded ExaScale Computing Project failed to gain traction outside of a narrow community, failed to provide sustainable maintenance in the face of the funded end of the ECP, and would thus fair similarly poorly on a spider web analysis of their sustainability, their recommendability. [16] \"Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows\", da Silva et al, \"https://zenodo.org/records/13844759. I participated in the event, as well as the prior in 2022, and you can compare to that report as well: \"Workflows Community Summit 2022: A Roadmap Revolution\", also da Silva et al, https://zenodo.org/records/7750670.[17] SC24, https://sc24.conference-program.com/[18] TOP 500 supercomputers, June 2024, https://top500.org/lists/top500/list/2024/06/ - to be updated again before Thanksgiving. ",
            "content_html": "<p class=\"ember-view reader-text-block__paragraph\" id=\"ember2131\"><span color=\"rgba(255, 255, 255, 0.9)\" style=\"font-family: verdana;\">Hello and happy fall holidays to you and yours.</span><span class=\"white-space-pre\" color=\"rgba(255, 255, 255, 0.9)\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2131\"><span style=\"font-family: verdana;\">As I wrote about in the last blog post [<a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.linkedin.com/pulse/surfing-singularity-universe-computes-andy-gallo-6fgle\" target=\"_self\">1</a>], as quantum computing hardware matures over the next 5 to 10 years from an experimental toy through to utility and then perhaps advantage over classical (for some applications), it will be included into an already diverse and hybrid computing and applications landscape - on-prem computing, mobile and edge, cloud, and now novel types of computing devices which require new thinking and wholly new means of addressing them. How to deal with the burgeoning heterogeneity of the computing landscape - how to write and run apps which produce and consume data across a widening array of devices- is the topic of this post.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2131\"><span style=\"font-family: verdana;\"><span class=\"white-space-pre\"><br /></span></span></p><h2><span style=\"font-family: verdana; font-size: x-large;\">Language Landscape<span class=\"white-space-pre\"> </span></span></h2><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2133\"><span style=\"font-family: verdana;\">The Java programming language, once touted in the glory days of \"the World Wide Web\" as being \"write once, deploy anywhere\", and in its heyday representing 25% of new application development, is now down below 10%. What's hot? Python (23%), and \"the C's\", a collection of C, C++, C# and their kin (&gt;24% in total) which are traditionally recompiled for specific hardware platforms. [2] And while Python provides portability, often for performance in math operations it depends on native libraries, built in, you guessed it, the C's. Into this mix wades the US government which has come out recently with a potentially disruptive statement against the use of the C's, citing security concerns due to their free-wheeling memory management, and in spite of efforts like Safe C++, the government is recommending movement to memory safe languages like Rust, currently with just 1% market share, but \"with a bullet\". [3] Whether it is better to port to Rust or just update to Safe C++ depends on many factors - for example, how good are your docs and test cases - and while there may exist conceptual impedance mismatches between languages, modern AI coding assistants will only increase in capability especially for more rote tasks like porting.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2134\"><span style=\"font-family: verdana;\">Add to this mix the coding of Graphical Processing Units (GPUs) - originally intended for visualizations but now used in applications for almost anything involving matrix math (turns out, lots of stuff). GPUs today are mostly sold by NVIDIA and are programmed in the C's (sometimes with a Python interface) using the NVIDIA CUDA library. These pieces of the application, the \"kernels\", are hardware dependent, and while many attempts have been made to create hardware-portable frameworks for GPU programming (see SYCL for example [4]), nearly always the newest fastest GPU features are available in the native non-portable form first, leading to vendor lock. (This might be a good time to remember that<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.linkedin.com/company/nvidiausa/\">NVIDIA</a><span class=\"white-space-pre\"> </span>does not themselves manufacture chips - they design chips which others produce.)</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2135\"><span style=\"font-family: verdana;\">The manner in which we program GPUs is similar to the way we program quantum computers, i.e. QPUs - we delegate to them the portions of the application to which they are best suited, program them using device-specific instructions, and weave them back into the holistic solution. Rather than wielding the Java hammer where everything is a virtualized nail, we use the best tool for the job. In quantum computing, for example, \"variational\" hybrid algorithms are a common theme, where some part of the work and preparation are performed on classical hardware as a setup for a quantum step, and then post-processing the results back on classical hardware for potential iteration to an optimal solution.<span class=\"white-space-pre\"> </span></span></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember2136\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQHHLPGy-FVQ7g/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1731379488723?e=1740009600&amp;v=beta&amp;t=rBfADA-gBrGoIdiIzfxE-P90i-WNLV2EFP7uYQzam20\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">Two of several emerging patterns for integrating quantum computing into an application solution. [5]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2137\"><span style=\"font-family: verdana;\">This pattern is analogous to what is also common in classical high performance computing (HPC) for applications like weather modeling and other complex simulations - pre-process on commodity hardware, run an HPC job on the big box, and post-process the results. The introduction into the mix of steerage provided by AI models increases the heterogeneity of the complete solution.<span class=\"white-space-pre\"> </span></span></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember2138\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQGVu8gKIzDRLg/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1731379555676?e=1740009600&amp;v=beta&amp;t=Bt4u6m8pi0i6xVNmfWxkxGK4bXBGkWuP3_zxrab1R3M\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">A blended computing landscape, enabling for example, quantum computing to produce highly precise data to train AI to steer a classical HPC simulation. [6]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2139\"><span style=\"font-family: verdana;\">All these hardware-dependent application pieces for an ever widening array of hardware means that compilers are cool again, and compiler pipelines like LLVM are critical to application development and deployment. [7] Included in this class of development tools are circuit transpilers for quantum hardware which must take into consideration not only the architectural differences between QPUs (e.g. which gates are supported, what's the inter-qubit connectivity like, etc.), but also the changes which can occur in a quantum data center on a daily basis as these new, noisy, and fragile qubits simply fail and go offline, potentially altering the machine's topology. Just-in-time compilation is needed, and compiler optimization is therefore also cool again. Thank you, Frances Allen. [8]<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2139\"><span style=\"font-family: verdana;\"><span class=\"white-space-pre\"><br /></span></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember2140\"><span style=\"font-family: verdana; font-size: x-large;\">Parts is Parts</span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2141\"><span style=\"font-family: verdana;\">What emerges from this landscape is not a singular executable running on one computer, but rather, multiple application piece parts, written in different languages, running on radically different hardware in sequence and simultaneously, being orchestrated into a complete solution.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2142\"><span style=\"font-family: verdana;\">In other words, a workflow. Back in the day Java's Sun Microsystems (remember them?) asserted \"the network is the computer\". Now we assert \"the workflow is the app\".<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2143\"><span style=\"font-family: verdana;\">Or more likely, a workflow of workflows. We like to think of these nested workflows in three types: [9]</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2144\"></p><ol><li><span style=\"font-family: verdana;\"><span>in-situ</span>: the workflow is running all on the same machine (e.g. a local process, an HPC job)</span></li><li><span style=\"font-family: verdana;\"><span>intra-site</span>: the workflow is running on different machines within the same connected enterprise (e.g. within the same data center, virtual network, etc.)</span></li><li><span style=\"font-family: verdana;\"><span>inter-site</span>: the workflow is running across different machines in different enterprises (e.g. hybrid on-prem and perhaps multi-vendor cloud)</span></li></ol><p></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2145\"><span style=\"font-family: verdana;\">With all these compute types, languages, and locations working together to realize the workflow and solution, loose coupling is key - components connected but not dependent - each part minding its own business. In other words, to paraphrase the poet, good interfaces make good neighbors. [10]</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2146\"><span style=\"font-family: verdana;\">We use the convenience term \"Site\" to mean a provider of secure compute and data services. What interfaces must a Site provide? The interface or API can include lots of things, but it must at least provide: 1) authentication and authorization, 2) a means to run components through their lifecycle, 3) a means to manage data being operated on and produced, perhaps being moved into and out of the Site, and 4) some way to get an inventory of the Site's service offerings and provision them for the purposes of running components or holding data. We call these by four functional nicknames: Auth, Run, Repo, and Spin.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2147\"><span style=\"font-family: verdana;\"><br /></span></p><div class=\"reader-image-block reader-image-block--resize\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember2148\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQGVO-K51OaPPw/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1731379758535?e=1740009600&amp;v=beta&amp;t=t0p27qST2ZLG6ETfkhu-0OppATVBXdcfX7Mgnw1Qk2A\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">Four functional pillars of an interoperable computing site.</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2149\"><span style=\"font-family: verdana;\">We can see in each of the three types of workflows the need for each of these four functional pillars, albeit some as a no-op or inherited from a higher order workflow. For example, in a \"type 1\" workflow of components running on a single machine or within an HPC allocation the Auth aspects may be implied to be already addressed - i.e. the user is already logged into the machine or authorized to run on the HPC cluster. But a workflow which utilizes compute resources both on-prem and in the cloud will have to interact at runtime with the \"auth\" aspects of the cloud provider prior to being able to \"run\" workloads, or put and get data to various \"repos\". Most cloud providers provide a means to list available computing resources, to \"spin\" them up and down. This provisioning itself can be part of an end-to-end workflow: authenticate, get an inventory of available services, spin some up, run jobs on them storing the results, and spin them down.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2149\"><span style=\"font-family: verdana;\"><span class=\"white-space-pre\"> </span></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember2150\"><span style=\"font-family: verdana; font-size: x-large;\">Stuck in the Middle</span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2151\"><span style=\"font-family: verdana;\">Most cloud providers - from Amazon to IBM Quantum cloud - provide a callable API interface which can be viewed through the lens of Auth, Run, Repo, Spin. So do some of the supercomputers and cutting edge resources provided by the Federal government, most notably those provided by the<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.linkedin.com/company/national-energy-research-scientific-computing-center/\">National Energy Research Scientific Computing Center (NERSC)</a>. [11]<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2152\"><span style=\"font-family: verdana;\">As Sites, these providers expose their offerings to internal and external workflows, however, they do not themselves promote a means to author these cross-site workflows, to manage them, track them, or keep tabs on all that distributed data. What else is needed? First, since cloud and other service providers have no motivation to standardize their interfaces, a framework super-interface could exist with the ability to plug in drivers for specific service providers. This in theory is the Auth, Run, Repo, Spin interface. Second, since each provider defines their own service and runtime component lifecycle (loosely: start, run, and stop with success or fail end states) there needs to be a way to normalize the status terminology - a \"fail\" on one site is the same as an \"error\" on another, \"success\" means the same thing as \"done\". This permits the third aspect of a middleware framework - the ability to track running jobs on Sites and trigger other jobs on any Site to run accordingly - i.e. the control flow of the workflow.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2153\"><span style=\"font-family: verdana;\">What about the data? Commonly we need the ability to put data to a Site and get some back - this is the Repo interface of the Site. And while most (but not all) Sites provide some means to store and retrieve data, be it filesystem or S3 object store or database or something else, it would also be nice to be able to say something \"meta\" about the data - which Site did it come from, what job or application produced it, what other workflow steps on this Site or others consumed it? Some Sites provide storage with metadata (e.g. Amazon S3) but most don't. This metadata comprises the provenance of the data - like a Civil War sword on the Antiques Roadshow, its the paper trail showing where the item came from, proving the item is legit. In a workflow which perhaps produces many pieces of data, perhaps iteratively as it converges on a solution - keeping track of all the data pieces seems, well, important. The acronym FAIR - findable, accessible, interoperable, reusable - seems a good starting point. [12]</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2153\"><span style=\"font-family: verdana;\"><br /></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember2154\"><span style=\"font-family: verdana; font-size: x-large;\">Open Says Me</span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2155\"><span style=\"font-family: verdana;\">Our open source project lwfm, the \"local workflow manager\", attempts to render these concepts as a reference implementation. [13] Its small with minimal Python lib dependencies and can be taken anywhere easily as a single runnable component, its provenancial metadata also easily portable and importable. A typical Site driver - a Python class which implements the Site interface - weighs in around 200 lines of code including the whitespace. Armed with a Site driver for a cloud service, you can author long-running workflows which utilize a mix of compute resources, storage, and data infrastructures, and automatically track the provenancial paper trail.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2156\"><span style=\"font-family: verdana;\">The lwfm middleware component provides some very recognizable services:</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2157\"></p><ul><li><span style=\"font-family: verdana;\">polling of remote job status<span class=\"white-space-pre\"> </span></span></li><li><span style=\"font-family: verdana;\">status normalization and persistence</span></li><li><span style=\"font-family: verdana;\">system and user metadata persistence</span></li><li><span style=\"font-family: verdana;\">event handling, control flow and data flow triggered</span></li></ul><p></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2158\"><span style=\"font-family: verdana;\">Should you use this tooling? I wouldn't recommend it. (Huh? Did I hear you correctly?) How many people are working maintaining it? (Two?) What about the community? (Next to none.) The software would fare poorly on a \"spider web\" analysis of its overall quality - you would not want to recommend it to your boss.</span></p><div class=\"reader-image-block reader-image-block--resize\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model\"><div class=\"ivm-view-attr__img-wrapper\"><img alt=\"\" class=\"ivm-view-attr__img--centered reader-image-block__img evi-image lazy-image ember-view\" id=\"ember2159\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQEs4kk5AG5gIA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1731379896216?e=1740009600&amp;v=beta&amp;t=AMlOWYzQl60i9SQsBL77teawfvRVlFW_7l_yHoT4Nnk\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">A convenient multi-axis assessment framework for software model maturity. [14]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2160\"><span style=\"font-family: verdana;\">The lwfm is a reference implementation of a workflow interop framework, at best. Are there alternatives? OMG are there alternatives! The workflow landscape is notoriously rich, fragmented, and super-niched. But portability and interoperability are often neglected as is data provenance. Government or university projects, while well meaning and sometimes directionally correct, quickly go stale when the funding elapses [15], and commercial solutions while often suffering some of the same deficiencies offer the added trap of vendor lock and can come with a hefty price tag.</span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember2161\"><span style=\"font-family: verdana;\">Order, Order</span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2162\"><span style=\"font-family: verdana;\">So its back to committee. [16] Next week the high performance computing community will be meeting again at the<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.linkedin.com/company/sc-conference/\">SC Conference Series</a><span class=\"white-space-pre\"> </span>Supercomputing 2024, this year in Atlanta. Hybrid workflows for scientific and engineering applications - involving classical HPC, AI-focused clusters, and now also quantum computers - will be among the very many topics discussed.[17] And we should expect some surprises - in the new rankings for example of top machines on the planet, at least, the ones they want us to know about. [18]</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2163\"><span style=\"font-family: verdana;\">Perhaps I'll report back on some of those returns in a future blog. Best regards. - andy<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2164\"><span style=\"font-family: verdana;\"><br /></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2165\"><span><span style=\"font-family: verdana; font-size: x-large;\">References &amp; Amusements<span class=\"white-space-pre\"> </span></span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2166\"><span style=\"font-family: verdana;\">[0] Banner photo by Ben Wicks on Unsplash</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2167\"><span style=\"font-family: verdana;\">[1] \"Surfing the Singularity: The Universe Computes\", A. Gallo,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.linkedin.com/pulse/surfing-singularity-universe-computes-andy-gallo-6fgle\" target=\"_self\">https://www.linkedin.com/pulse/surfing-singularity-universe-computes-andy-gallo-6fgle</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2168\"><span style=\"font-family: verdana;\">[2] TIOBE ranking of programming language popularity:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.tiobe.com/tiobe-index/\" target=\"_self\">https://www.tiobe.com/tiobe-index/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2169\"><span style=\"font-family: verdana;\">[3] Safe C++, with some chronology of the government statements:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://safecpp.org/\" target=\"_self\">https://safecpp.org/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2170\"><span style=\"font-family: verdana;\">[4] SYCL:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.khronos.org/sycl/\" target=\"_self\">https://www.khronos.org/sycl/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2171\"><span style=\"font-family: verdana;\">[5] \"Post-variational quantum neural networks\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://pennylane.ai/qml/demos/tutorial_post-variational_quantum_neural_networks\" target=\"_self\">https://pennylane.ai/qml/demos/tutorial_post-variational_quantum_neural_networks</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2172\"><span style=\"font-family: verdana;\">[6] \"Hope Versus Hype: Quantum, AI and the Path to Commercial Advantage\", Matthias Troyer, presentation at IEEE Quantum Week, Montreal, September 2024.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2173\"><span style=\"font-family: verdana;\">[7] LLVM:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://llvm.org/\" target=\"_self\">https://llvm.org/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2174\"><span style=\"font-family: verdana;\">[8]<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://amturing.acm.org/award_winners/allen_1012327.cfm\" target=\"_self\">https://amturing.acm.org/award_winners/allen_1012327.cfm</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2175\"><span style=\"font-family: verdana;\">[9] \"Industrial Experience Deploying Heterogeneous Platforms for Use in Multi-Modal Power Systems Design Workflows\", A. Gallo et al,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://drive.google.com/file/d/1c3YEVmEAUjbI5urj4PiV2TtjzBUzLlws\" target=\"_self\">https://drive.google.com/file/d/1c3YEVmEAUjbI5urj4PiV2TtjzBUzLlws</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2176\"><span style=\"font-family: verdana;\">[10] \"Mending Wall, Robert Frost,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://www.poetryfoundation.org/poems/44266/mending-wall\" target=\"_self\">https://www.poetryfoundation.org/poems/44266/mending-wall</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2177\"><span style=\"font-family: verdana;\">[11] NERSC SuperFacility API:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://docs.nersc.gov/services/sfapi/\" target=\"_self\">https://docs.nersc.gov/services/sfapi/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2178\"><span style=\"font-family: verdana;\">[12] \"The FAIR Guiding</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2179\"><span style=\"font-family: verdana;\">Principles for scientific data management and stewardship\", Mark D. Wilkinson et al.,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://pmc.ncbi.nlm.nih.gov/articles/PMC4792175/pdf/sdata201618.pdf\" target=\"_self\">https://pmc.ncbi.nlm.nih.gov/articles/PMC4792175/pdf/sdata201618.pdf</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2180\"><span style=\"font-family: verdana;\">[13] lwfm,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://github.com/lwfm-proj/lwfm\" target=\"_self\">https://github.com/lwfm-proj/lwfm</a><span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2181\"><span style=\"font-family: verdana;\">[14] \"Model Maturity Web\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://richardarthur.medium.com/co-design-web-6f37664ac1e1\" target=\"_self\">https://richardarthur.medium.com/co-design-web-6f37664ac1e1</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2182\"><span style=\"font-family: verdana;\">[15] Them's fighting words, and I expect to be roasted for it. But it seems to me that even the most popular software tool kits (no names) which emerged from the massively government funded ExaScale Computing Project failed to gain traction outside of a narrow community, failed to provide sustainable maintenance in the face of the funded end of the ECP, and would thus fair similarly poorly on a spider web analysis of their sustainability, their recommendability.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2183\"><span style=\"font-family: verdana;\">[16] \"Workflows Community Summit 2024: Future Trends and Challenges in Scientific Workflows\", da Silva et al, \"<a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://zenodo.org/records/13844759\" target=\"_self\">https://zenodo.org/records/13844759</a>. I participated in the event, as well as the prior in 2022, and you can compare to that report as well: \"Workflows Community Summit 2022: A Roadmap Revolution\", also da Silva et al,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://zenodo.org/records/7750670\" target=\"_self\">https://zenodo.org/records/7750670</a>.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2184\"><span style=\"font-family: verdana;\">[17] SC24,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://sc24.conference-program.com/\" target=\"_self\">https://sc24.conference-program.com/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember2185\"><span style=\"font-family: verdana;\">[18] TOP 500 supercomputers, June 2024,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo\" href=\"https://top500.org/lists/top500/list/2024/06/\" target=\"_self\">https://top500.org/lists/top500/list/2024/06/</a><span class=\"white-space-pre\"> </span>- to be updated again before Thanksgiving.<span class=\"white-space-pre\"> </span></span></p>",
            "url": "https://hpc.social/personal-blog/2024/surfing-the-singularity-the-workflow-is-the-app/",
            
            
            
            
            
            "date_published": "2024-11-12T17:00:00-07:00",
            "date_modified": "2024-11-12T17:00:00-07:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/surfing-the-singularity-the-universe-computes/",
            "title": "Surfing the Singularity - The Universe Computes",
            "summary": null,
            "content_text": "Just back from the IEEE Computer Society Quantum Week in Montreal, and besides eating my weight in pastry and bagels [1], it was a great conference. The collective hardware roadmaps from the major players leaves us thinking the big wave in quantum computing is not here yet, but soon - perhaps in 5 years time for scientific applications, and within a decade for commercial utility. While the current software continues to be sparse and low-level, there are inklings of software engineers starting to build up a stack in anticipation of needing one. But besides that, and at the risk of sounding like the other hot topic - AI marketeer hype [2] - there is the sense with quantum of being present at a new phase in computing at least, if not something larger still. The concept of the computing universe is still just a hypothesis; nothing has been proved. However, I am confident that this idea can help unveil the secrets of nature. - Konrad Zuse, 1969 [3]It seems, at its core, that the universe computes. Conch shells grow in logarithmic spirals, bees and orb weavers understand structural geometry. Many animals - not including Mr. Ed or Clever Hans [4], but including primates, fish, and rats - have been shown to use simple arithmetic or approximations, and in other cases can show ability to order objects in a list. There are lizards and sea shells with surface patterns constructed by cellular automata processes - simple rules which can produce complex structures, like Conway's \"Game of Life\". Ducks, using an inherently quantum mechanical biological process, can see magnetic fields giving them a kind of \"heads up\" display when migrating. [5] A variation on Conway's \"Game of Life\" [6]Our own eyes are themselves literally photon detectors, our retinas and optic nerves pre-processing the signals before they even get to the brain. DNA stores vast amounts of information in simple patterns of four \"letters\" (effectively two classical bits), recipes to manufacture from the raw material of the universe a wide array of proteins for all manner of biological purposes, including of course growing your own brain. Humans can themselves perform the manual calculations necessary to build bridges and other structures which can span and withstand the forces of nature for centuries. Its also not hard to see computation of a kind in plants and their systemic networks of roots. Is the universe computing, or is the universe doing the only thing it can based on the rules? Water flowing downhill. Lightning finding its own path of least resistance. Entangled electrons separated at distance flipping their spin in response to their partner, in real time. [7]&lt;div class=\"reader-embed-block__iframe-embed\"&gt;&lt;/div&gt;An Entangled HistoryNature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical. - Richard Feynman [8] I think I can safely say that nobody understands quantum mechanics. - Richard Feynman [9]In the 1920s physicists like Heisenberg, Born, Pauli, and Schrödinger convinced their peers of the validity of a new formulation of the laws of physics they called (in German) \"quantum mechanics\", describing the behavior of nature and the universe even below the scale of atoms. This then led computing pioneer John Von Neumann in the 1930s to solidify some of the necessary maths to perform discrete quantum mechanical calculations. Von Neumann, a member of the Manhattan Project, would go on to formulate the hardware architecture for today's \"classical\" computers, the approach to computing being challenged by quantum computing today. Since then, mostly notably in the 1940s with the Manhattan Project, humans have shown an increasing ability to harness the basic quantum physics into a new range of applications. We learned how to make superconductors, and how to park and manipulate individual atoms like tinkertoys [10], and now we've learned how to use these skills to make computers, with most immediate applications in modeling quantum systems like atoms and molecules, as Nobel laureate and Manhattan Project member Feynman predicted more than 40 years ago. We learned how to make lasers and LEDs, and we can now also harness photons for computing. The scientists have had nearly a century to refine their theories, and are now handing off to the engineers to prove the depth of human understanding by building things in the real world, with the business people eagerly waiting on the sidelines in anticipation (think: MRI machines). It seems that the universe computes - we now endeavor to make use of that knowledge for our own human purposes. Qubits, Gates, and Error EverywhereWhat is a qubit? Like the early video game Q*bert which showed a simulated 3D world on a 2D screen back in 1982, a qubit is a little hard to visualize as it goes deeper than a classical binary bit - much deeper. While a bit can be either a zero or a one but nothing else, a qubit can model the probability of a zero or a one and probabilistically anything in between. It might be a zero, or it might be a one, with some probability of each. It might start off a zero, and then noise from the environment might cause it to drift, or it might move off the zero purposefully as a result of acting on the qubit with one of several kinds of single and multi-qubit gates. Like gates in classical computing, a quantum gate can flip the qubit, or unique to quantum just nudge it a little. As in classical computing, its the acting on the qubit by gates which results in the computing. A native quantum program is a circuit, a directed graph, composed of gates. Visualizing the effect of various quantum gates on a single qubit. [11] How fast can we flip a qubit? Heisenberg 100 years ago gave us a way to compute the lower bound on the time to flip a spin given an energy - in short, its fast. But this doesn't even tell the whole performance story because of superposition and the ability of wide circuits to act on multiple qubits simultaneously - the speed advantage of quantum over classical can be exponential, albeit application-specific, making a class of problems which would be classically uncomputable in any human lifetime now well within reach. The main challenge in realizing these potentials is the noise. Qubits are noisy, meaning they don't stay fixed where you think you last left them, and the seemingly magical entanglement also can show decoherence over time and distance. The gates necessary for computation can themselves introduce noisy error, as can the act of measuring the qubit to sample the solution result. Software algorithms can also just be estimates, and thus introduce their own error term relative to experiment. The prevalence of error in quantum computing means the algorithms themselves must be aware the results might be unpredictable, might need to be computed more then once to improve confidence, and might need to allocate and use a good number of precious qubits just to help mitigate the errors. We call the period we are in today the \"NISQ era\", meaning, noisy intermediate scale quantum computing. Beyond the noise, how do we quantify \"scale\" or other metrics for sizing up the capabilities of quantum computers, now, in future, and as compared to classical? One aspect of the problem is that when comparing quantum to classical we're not comparing apples to apples, and even within \"quantum apples\" as we have seen, there are different kinds. [12] In one simple measure we can count qubits, but we must also know their error rate, and we must notice something about their connectedness - some quantum hardware use an all-to-all grid, and others use other topologies - racetracks and the like. And qubits can fail. Because of the limitations of qubits in the NISQ era the connectedness matters, is necessary to be known at time of circuit transpilation, and may result in swaps or other strategies employed by the transpiler toolchain to minimize errors due to the physical layout. Qubit coherence in superposition can be measured and reported as a hardware spec. Quantum volume is a number which expresses the size of a circuit N qubits wide by d gates deep which can be executed on a given machine. Gate errors especially for 2-qubit gates can be reported by the vendor. CLOPS - circuit layer operations per second - is another proposed metric which takes into consideration the time to prepare the qubits, execute the gates, and take the measurement of the result. [13] The US government in the form of Defense Advanced Research Projects Agency (DARPA) has gotten into the game of studying this varied performance landscape, towards being able to help pick winners and losers and accelerate innovation with funding awards. [14] Quantum Hardware RoadmapI build quantum computers that store information on individual atoms and then massage the normal interactions between atoms to make them compute. - Seth Lloyd [15]This year's IEEE Quantum Week was an opportunity to see and hear from most of the major players in quantum computing R&amp;D - those focused on quantum processors, systems control, networking, and software. The software topic we'll leave as a topic of a future blog, but focusing on hardware, the vendors collectively represented multiple distinct technical mechanisms to making a quantum computing machine. There's superconducting qubits from US companies like IBM Quantum , Google , and Rigetti Computing, which refrigerate and maintain the qubit between a ground and an excited state. Trapped ion computers from companies like IonQ and Quantinuum , and neutral atom computers from QuEra Computing Inc. use novel methods to again cool the system qubits to near absolute zero. But there are quantum computers which also operate quite differently. Quantum annealing, or algorithmically simulating an adiabatic process for slowly evolving a system to an optimal state, could be simulated on one of the above general quantum machines, or shown more directly on a specialized quantum machine from a Canadian company like D-Wave. Xanadu, also based in Canada, performs its quantum tricks with photonics. And while Google may jump the gun on announcing successes from time to time, they and Microsoft and others are working on a \"topological qubit\" based on previously only-hypothesized Majorana particles which provide the great advantage relative to other qubit implementations of being able to be controlled digitally. [16, 17] Staying within the NISQ era as the machines scale up, a good chunk of the available qubits will continue to be allocated to error correction schemes, a task which may later as these systems mature be allocated to a software layer. At 100s of useful error-corrected qubits we can start to gain real scientific utility from quantum computing - begin to do research with quantum rather than research about quantum. Vendors such as Quantinuum promise a fully connected machine of that size in 5 years. In 10 years, vendors expect to deliver machines with 1000s of QEC, which will usher in commercial utility, and the era of \"cryptographically relevant\" quantum computing (i.e. DARPA wants the US to get there first [18]). In the meantime, certain scientific domains, those which study things most closely associated with real quantum systems, will be early adopters of the technology. Molecular biology. Chemistry, for example, studying better ways to perform synthetic nitrogen fixation (think: energy-costly ammonia production for fertilizers). Conclusion The quantum hardware industry is in its infancy. From this gaggle of eager go-getters it’s reasonable to assume there will be technical and business winners and losers. For reasons of national security, governments will ramp up their involvement. But current machines are small, flaky, and limited in usefulness. It will be 5 to 10 years before there are quantum computers being used more commonly. A new but also a familiar approach to software will be needed - more on that in a future blog. Until utility some industries will be early leaders, ready to capitalize on an exponential increase in computing capability, one which promises to get us closer to harnessing the grand computing engine of the universe which is all around and within.References &amp; Trivia[0] Photo by Ben Wicks on Unsplash[1] Montreal bagels: https://www.mtl.org/en/experience/the-famous-montreal-bagel[2] Gartner AI Hype Cycle 2024 explained: https://www.youtube.com/watch?v=qXKYOR3KqxQ [3] \"Calculating Space\", Konrad Zuse, 1969, https://philpapers.org/archive/ZUSRR.pdf Its worth noting that while having worked for Ford Motor Co. in his early career, and like Von Neumann doing very important early work on computers, Zuse was a conscripted employee of the German Nazi government from 1939-1945.[4] Clever Hans: https://www.horsejournals.com/popular/history-heritage/clever-hans [5] \"How Migrating Birds Use Quantum Effects to Navigate\", Scientific American, April 2022, https://www.scientificamerican.com/article/how-migrating-birds-use-quantum-effects-to-navigate/[6] A variation on Conway's Game of Life, https://stackoverflow.com/questions/70019538/simple-animation-for-conways-game-of-life-with-funcanimation[7] \"Real-Time Imaging of Quantum Entanglement\", 2013, https://youtu.be/wGkx1MUw2TU?si=mnIExRs2ZOwv46Bh, but not strangely enough \"Entanglement between superconducting qubits and a tardigrade\", https://arxiv.org/pdf/2112.07978 [8] \"Simulating Physics with Computers\", International Journal of Theoretical Physics vol 21, transcript of a talk at MIT by Richard Feynman, 1981, https://s2.smu.edu/~mitch/class/5395/papers/feynman-quantum-1981.pdf[9] \"The Character of Physical Law\", transcript of lectures by Richard Feynman at Cornell U, 1967, https://archive.org/details/characterofphysi0000feyn/page/12/mode/2up[10] \"2 Researchers Spell 'I.B.M.' Atom by Atom\", New York Times, April 5, 1990,https://timesmachine.nytimes.com/timesmachine/1990/04/05/356490.html?pageNumber=41[11] \"Qubit Bloch Sphere Visualization\", Casey Duckering, https://raw.githubusercontent.com/cduck/bloch_sphere/master/examples/xyss_gate.gif[12] Its apple picking season here in New York: https://www.applesfromny.com/varieties/[13] \"Driving quantum performance: more qubits, higher Quantum Volume, and now a proper measure of speed\", https://www.ibm.com/quantum/blog/circuit-layer-operations-per-second[14] DARPA Quantum Benchmarking Initiative, https://www.darpa.mil/work-with-us/quantum-benchmarking-initiative[15] \"The Computational Universe\", Seth Lloyd, 2002, https://www.edge.org/conversation/seth_lloyd-the-computational-universe[16] \"Google Claims To Achieve Quantum Supremacy — IBM Pushes Back\", https://www.npr.org/2019/10/23/772710977/google-claims-to-achieve-quantum-supremacy-ibm-pushes-back[17] \"A route to scalable Majorana qubits\", https://phys.org/news/2024-06-route-scalable-majorana-qubits.html[18] \"DARPA's quantum computing is powered by ... FOMO\", https://www.theregister.com/2023/02/02/darpa_quantum_microsoft/",
            "content_html": "<p><span style=\"font-family: verdana;\"><span>Just back from the</span><span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/ieee-computer-society/\" target=\"_self\">IEEE Computer Society</a><span class=\"white-space-pre\"> </span><span>Quantum Week in Montreal, and besides eating my weight in pastry and bagels [1], it was a great conference. The collective hardware roadmaps from the major players leaves us thinking the big wave in quantum computing is not here yet, but soon - perhaps in 5 years time for scientific applications, and within a decade for commercial utility. While the current software continues to be sparse and low-level, there are inklings of software engineers starting to build up a stack in anticipation of needing one. But besides that, and at the risk of sounding like the other hot topic - AI marketeer hype [2] - there is the sense with quantum of being present at a new phase in computing at least, if not something larger still.</span><span class=\"white-space-pre\"> </span></span></p><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember3175\"><span style=\"color: #04ff00; font-family: verdana;\">The concept of the computing universe is still just a hypothesis; nothing has been proved. However, I am confident that this idea can help unveil the secrets of nature. - Konrad Zuse, 1969 [3]</span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3176\"><span style=\"font-family: verdana;\">It seems, at its core, that the universe computes. Conch shells grow in logarithmic spirals, bees and orb weavers understand structural geometry. Many animals - not including Mr. Ed or Clever Hans [4], but including primates, fish, and rats - have been shown to use simple arithmetic or approximations, and in other cases can show ability to order objects in a list. There are lizards and sea shells with surface patterns constructed by cellular automata processes - simple rules which can produce complex structures, like Conway's \"Game of Life\". Ducks, using an inherently quantum mechanical biological process, can see magnetic fields giving them a kind of \"heads up\" display when migrating. [5]<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3177\"><span style=\"font-family: verdana;\"><br /></span></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model   \"><div class=\"ivm-view-attr__img-wrapper                \"><img alt=\"\" class=\"ivm-view-attr__img--centered  reader-image-block__img evi-image lazy-image ember-view\" id=\"ember3178\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQF2LHaealtiYA/article-inline_image-shrink_1000_1488/article-inline_image-shrink_1000_1488/0/1727727559513?e=1740009600&amp;v=beta&amp;t=vLl0WFGaEuH2xJfhffuvIBSRwDXRnMtGU8m-kADT2c4\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">A variation on Conway's \"Game of Life\" [6]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3179\"><span style=\"font-family: verdana;\">Our own eyes are themselves literally photon detectors, our retinas and optic nerves pre-processing the signals before they even get to the brain. DNA stores vast amounts of information in simple patterns of four \"letters\" (effectively two classical bits), recipes to manufacture from the raw material of the universe a wide array of proteins for all manner of biological purposes, including of course growing your own brain. Humans can themselves perform the manual calculations necessary to build bridges and other structures which can span and withstand the forces of nature for centuries. Its also not hard to see computation of a kind in plants and their systemic networks of roots. Is the universe computing, or is the universe doing the only thing it can based on the rules? Water flowing downhill. Lightning finding its own path of least resistance. Entangled electrons separated at distance flipping their spin in response to their partner, in real time. [7]</span></p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p><br />&lt;div class=\"reader-embed-block__iframe-embed\"&gt;<br />&lt;/div&gt;</p><div class=\"reader-embed-block__iframe-embed\"><br /></div><div class=\"reader-embed-block__iframe-embed\"></div><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3181\"><span style=\"font-family: verdana; font-size: x-large;\">An Entangled History</span></h3><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember3182\"><span style=\"color: #04ff00; font-family: verdana;\">Nature isn't classical, dammit, and if you want to make a simulation of nature, you'd better make it quantum mechanical. - Richard Feynman [8]<span class=\"white-space-pre\"> </span></span></blockquote><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember3183\"><span style=\"color: #04ff00; font-family: verdana;\">I think I can safely say that nobody understands quantum mechanics. - Richard Feynman [9]</span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3184\"><span style=\"font-family: verdana;\">In the 1920s physicists like Heisenberg, Born, Pauli, and Schrödinger convinced their peers of the validity of a new formulation of the laws of physics they called (in German) \"quantum mechanics\", describing the behavior of nature and the universe even below the scale of atoms. This then led computing pioneer John Von Neumann in the 1930s to solidify some of the necessary maths to perform discrete quantum mechanical calculations. Von Neumann, a member of the Manhattan Project, would go on to formulate the hardware architecture for today's \"classical\" computers, the approach to computing being challenged by quantum computing today.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3185\"><span style=\"font-family: verdana;\">Since then, mostly notably in the 1940s with the Manhattan Project, humans have shown an increasing ability to harness the basic quantum physics into a new range of applications. We learned how to make superconductors, and how to park and manipulate individual atoms like tinkertoys [10], and now we've learned how to use these skills to make computers, with most immediate applications in modeling quantum systems like atoms and molecules, as Nobel laureate and Manhattan Project member Feynman predicted more than 40 years ago. We learned how to make lasers and LEDs, and we can now also harness photons for computing. The scientists have had nearly a century to refine their theories, and are now handing off to the engineers to prove the depth of human understanding by building things in the real world, with the business people eagerly waiting on the sidelines in anticipation (think: MRI machines). It seems that the universe computes - we now endeavor to make use of that knowledge for our own human purposes.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3186\"><span style=\"font-family: verdana;\"><br /></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3187\"><span style=\"font-family: verdana; font-size: x-large;\">Qubits, Gates, and Error Everywhere</span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3188\"><span style=\"font-family: verdana;\">What is a qubit? Like the early video game Q*bert which showed a simulated 3D world on a 2D screen back in 1982, a qubit is a little hard to visualize as it goes deeper than a classical binary bit - much deeper. While a bit can be either a zero or a one but nothing else, a qubit can model the<span class=\"white-space-pre\"> </span><span>probability</span><span class=\"white-space-pre\"> </span>of a zero or a one and probabilistically anything in between. It might be a zero, or it might be a one, with some probability of each. It might start off a zero, and then noise from the environment might cause it to drift, or it might move off the zero purposefully as a result of acting on the qubit with one of several kinds of single and multi-qubit gates. Like gates in classical computing, a quantum gate can flip the qubit, or unique to quantum just nudge it a little. As in classical computing, its the acting on the qubit by gates which results in the computing. A native quantum program is a circuit, a directed graph, composed of gates.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3189\"><span style=\"font-family: verdana;\"><br /></span></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model   \"><div class=\"ivm-view-attr__img-wrapper                \"><img alt=\"\" class=\"ivm-view-attr__img--centered  reader-image-block__img evi-image lazy-image ember-view\" id=\"ember3190\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQFAklbbJReWZA/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1727726980638?e=1740009600&amp;v=beta&amp;t=vnbWE-YUgsGxlQf5M2FnYJ7DLtJl-P7NvK_b0j3YVw0\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\"><span style=\"font-family: verdana;\">Visualizing the effect of various quantum gates on a single qubit. [11]</span></figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3191\"><span class=\"white-space-pre\"><span style=\"font-family: verdana;\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3192\"><span style=\"font-family: verdana;\">How fast can we flip a qubit? Heisenberg 100 years ago gave us a way to compute the lower bound on the time to flip a spin given an energy - in short, its fast. But this doesn't even tell the whole performance story because of superposition and the ability of wide circuits to act on multiple qubits simultaneously - the speed advantage of quantum over classical can be exponential, albeit application-specific, making a class of problems which would be classically uncomputable in any human lifetime now well within reach.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3193\"><span style=\"font-family: verdana;\">The main challenge in realizing these potentials is the noise. Qubits are noisy, meaning they don't stay fixed where you think you last left them, and the seemingly magical entanglement also can show decoherence over time and distance. The gates necessary for computation can themselves introduce noisy error, as can the act of measuring the qubit to sample the solution result. Software algorithms can also just be estimates, and thus introduce their own error term relative to experiment. The prevalence of error in quantum computing means the algorithms themselves must be aware the results might be unpredictable, might need to be computed more then once to improve confidence, and might need to allocate and use a good number of precious qubits just to help mitigate the errors. We call the period we are in today the \"NISQ era\", meaning, noisy intermediate scale quantum computing.<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3194\"><span style=\"font-family: verdana;\">Beyond the noise, how do we quantify \"scale\" or other metrics for sizing up the capabilities of quantum computers, now, in future, and as compared to classical? One aspect of the problem is that when comparing quantum to classical we're not comparing apples to apples, and even within \"quantum apples\" as we have seen, there are different kinds. [12] In one simple measure we can count qubits, but we must also know their error rate, and we must notice something about their connectedness - some quantum hardware use an all-to-all grid, and others use other topologies - racetracks and the like. And qubits can fail. Because of the limitations of qubits in the NISQ era the connectedness matters, is necessary to be known at time of circuit transpilation, and may result in swaps or other strategies employed by the transpiler toolchain to minimize errors due to the physical layout. Qubit coherence in superposition can be measured and reported as a hardware spec. Quantum<span class=\"white-space-pre\"> </span><span>volume</span><span class=\"white-space-pre\"> </span>is a number which expresses the size of a circuit N qubits wide by d gates deep which can be executed on a given machine. Gate errors especially for 2-qubit gates can be reported by the vendor. CLOPS - circuit layer operations per second - is another proposed metric which takes into consideration the time to prepare the qubits, execute the gates, and take the measurement of the result. [13] The US government in the form of<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/darpa/\">Defense Advanced Research Projects Agency (DARPA)</a><span class=\"white-space-pre\"> </span>has gotten into the game of studying this varied performance landscape, towards being able to help pick winners and losers and accelerate innovation with funding awards. [14]<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3195\"><span style=\"font-family: verdana;\"><br /></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3196\"><span style=\"font-family: verdana; font-size: x-large;\">Quantum Hardware Roadmap</span></h3><blockquote class=\"ember-view reader-text-block__blockquote\" id=\"ember3197\"><span style=\"color: #04ff00; font-family: verdana;\">I build quantum computers that store information on individual atoms and then massage the normal interactions between atoms to make them compute. - Seth Lloyd [15]</span></blockquote><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3198\"><span style=\"font-family: verdana;\">This year's IEEE Quantum Week was an opportunity to see and hear from most of the major players in quantum computing R&amp;D - those focused on quantum processors, systems control, networking, and software. The software topic we'll leave as a topic of a future blog, but focusing on hardware, the vendors collectively represented multiple distinct technical mechanisms to making a quantum computing machine. There's superconducting qubits from US companies like<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/showcase/ibm-quantum/\">IBM Quantum</a><span class=\"white-space-pre\"> </span>,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/google/\">Google</a><span class=\"white-space-pre\"> </span>, and<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/rigetti-computing/\">Rigetti Computing</a>, which refrigerate and maintain the qubit between a ground and an excited state. Trapped ion computers from companies like<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/ionq-co/\">IonQ</a><span class=\"white-space-pre\"> </span>and<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/quantinuumqc/\">Quantinuum</a><span class=\"white-space-pre\"> </span>, and neutral atom computers from<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/quera-computing-inc/\">QuEra Computing Inc.</a><span class=\"white-space-pre\"> </span>use novel methods to again cool the system qubits to near absolute zero. But there are quantum computers which also operate quite differently. Quantum annealing, or algorithmically simulating an adiabatic process for slowly evolving a system to an optimal state, could be simulated on one of the above general quantum machines, or shown more directly on a specialized quantum machine from a Canadian company like<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/d-wave-quantum/\">D-Wave</a>.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/xanaduai/\">Xanadu</a>, also based in Canada, performs its quantum tricks with photonics. And while Google may jump the gun on announcing successes from time to time, they and<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/microsoft/\">Microsoft</a><span class=\"white-space-pre\"> </span>and others are working on a \"topological qubit\" based on previously only-hypothesized Majorana particles which provide the great advantage relative to other qubit implementations of being able to be controlled digitally. [16, 17]<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3199\"><span style=\"font-family: verdana;\">Staying within the NISQ era as the machines scale up, a good chunk of the available qubits will continue to be allocated to error correction schemes, a task which may later as these systems mature be allocated to a software layer. At 100s of useful error-corrected qubits we can start to gain real scientific utility from quantum computing - begin to do research<span class=\"white-space-pre\"> </span><span>with</span><span class=\"white-space-pre\"> </span>quantum rather than research<span class=\"white-space-pre\"> </span><span>about</span><span class=\"white-space-pre\"> </span>quantum. Vendors such as Quantinuum promise a fully connected machine of that size in 5 years. In 10 years, vendors expect to deliver machines with 1000s of QEC, which will usher in commercial utility, and the era of \"cryptographically relevant\" quantum computing (i.e. DARPA wants the US to get there first [18]).<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3200\"><span style=\"font-family: verdana;\">In the meantime, certain scientific domains, those which study things most closely associated with real quantum systems, will be early adopters of the technology. Molecular biology. Chemistry, for example, studying better ways to perform synthetic nitrogen fixation (think: energy-costly ammonia production for fertilizers).<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3201\"><span style=\"font-family: verdana;\"><br /></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3202\"><span style=\"font-family: verdana; font-size: x-large;\">Conclusion<span class=\"white-space-pre\"> </span></span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3203\"><span style=\"font-family: verdana;\">The quantum hardware industry is in its infancy. From this gaggle of eager go-getters it’s reasonable to assume there will be technical and business winners and losers. For reasons of national security, governments will ramp up their involvement. But current machines are small, flaky, and limited in usefulness. It will be 5 to 10 years before there are quantum computers being used more commonly. A new but also a familiar approach to software will be needed - more on that in a future blog. Until utility some industries will be early leaders, ready to capitalize on an exponential increase in computing capability, one which promises to get us closer to harnessing the grand computing engine of the universe which is all around and within.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3204\"><span style=\"font-family: verdana;\"><br /></span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3205\"><span style=\"font-family: verdana; font-size: x-large;\">References &amp; Trivia</span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3206\"><span style=\"font-family: verdana;\">[0] Photo by Ben Wicks on Unsplash</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3207\"><span style=\"font-family: verdana;\">[1] Montreal bagels:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.mtl.org/en/experience/the-famous-montreal-bagel\" target=\"_self\">https://www.mtl.org/en/experience/the-famous-montreal-bagel</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3208\"><span style=\"font-family: verdana;\">[2] Gartner AI Hype Cycle 2024 explained:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.youtube.com/watch?v=qXKYOR3KqxQ\" target=\"_self\">https://www.youtube.com/watch?v=qXKYOR3KqxQ</a><span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3209\"><span style=\"font-family: verdana;\">[3] \"Calculating Space\", Konrad Zuse, 1969,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://philpapers.org/archive/ZUSRR.pdf\" target=\"_self\">https://philpapers.org/archive/ZUSRR.pdf</a><span class=\"white-space-pre\"> </span>Its worth noting that while having worked for Ford Motor Co. in his early career, and like Von Neumann doing very important early work on computers, Zuse was a conscripted employee of the German Nazi government from 1939-1945.</span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3210\"><span style=\"font-family: verdana;\">[4] Clever Hans:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.horsejournals.com/popular/history-heritage/clever-hans\" target=\"_self\">https://www.horsejournals.com/popular/history-heritage/clever-hans</a><span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3211\"><span style=\"font-family: verdana;\">[5] \"How Migrating Birds Use Quantum Effects to Navigate\", Scientific American, April 2022,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.scientificamerican.com/article/how-migrating-birds-use-quantum-effects-to-navigate/\" target=\"_self\">https://www.scientificamerican.com/article/how-migrating-birds-use-quantum-effects-to-navigate/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3212\"><span style=\"font-family: verdana;\">[6] A variation on Conway's Game of Life,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://stackoverflow.com/questions/70019538/simple-animation-for-conways-game-of-life-with-funcanimation\" target=\"_self\">https://stackoverflow.com/questions/70019538/simple-animation-for-conways-game-of-life-with-funcanimation</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3213\"><span style=\"font-family: verdana;\">[7] \"Real-Time Imaging of Quantum Entanglement\", 2013,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://youtu.be/wGkx1MUw2TU?si=mnIExRs2ZOwv46Bh\" target=\"_self\">https://youtu.be/wGkx1MUw2TU?si=mnIExRs2ZOwv46Bh</a>, but not strangely enough \"Entanglement between superconducting qubits and a tardigrade\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://arxiv.org/pdf/2112.07978\" target=\"_self\">https://arxiv.org/pdf/2112.07978</a><span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3214\"><span style=\"font-family: verdana;\">[8] \"Simulating Physics with Computers\", International Journal of Theoretical Physics vol 21, transcript of a talk at MIT by Richard Feynman, 1981,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://s2.smu.edu/~mitch/class/5395/papers/feynman-quantum-1981.pdf\" target=\"_self\">https://s2.smu.edu/~mitch/class/5395/papers/feynman-quantum-1981.pdf</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3215\"><span style=\"font-family: verdana;\">[9] \"The Character of Physical Law\", transcript of lectures by Richard Feynman at Cornell U, 1967,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://archive.org/details/characterofphysi0000feyn/page/12/mode/2up\" target=\"_self\">https://archive.org/details/characterofphysi0000feyn/page/12/mode/2up</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3216\"><span style=\"font-family: verdana;\">[10] \"2 Researchers Spell 'I.B.M.' Atom by Atom\", New York Times, April 5, 1990,<a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://timesmachine.nytimes.com/timesmachine/1990/04/05/356490.html?pageNumber=41\" target=\"_self\">https://timesmachine.nytimes.com/timesmachine/1990/04/05/356490.html?pageNumber=41</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3217\"><span style=\"font-family: verdana;\">[11] \"Qubit Bloch Sphere Visualization\", Casey Duckering,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://raw.githubusercontent.com/cduck/bloch_sphere/master/examples/xyss_gate.gif\" target=\"_self\">https://raw.githubusercontent.com/cduck/bloch_sphere/master/examples/xyss_gate.gif</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3218\"><span style=\"font-family: verdana;\">[12] Its apple picking season here in New York:<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.applesfromny.com/varieties/\" target=\"_self\">https://www.applesfromny.com/varieties/</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3219\"><span style=\"font-family: verdana;\">[13] \"Driving quantum performance: more qubits, higher Quantum Volume, and now a proper measure of speed\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.ibm.com/quantum/blog/circuit-layer-operations-per-second\" target=\"_self\">https://www.ibm.com/quantum/blog/circuit-layer-operations-per-second</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3220\"><span style=\"font-family: verdana;\">[14] DARPA Quantum Benchmarking Initiative,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.darpa.mil/work-with-us/quantum-benchmarking-initiative\" target=\"_self\">https://www.darpa.mil/work-with-us/quantum-benchmarking-initiative</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3221\"><span style=\"font-family: verdana;\">[15] \"The Computational Universe\", Seth Lloyd, 2002,<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.edge.org/conversation/seth_lloyd-the-computational-universe\" target=\"_self\">https://www.edge.org/conversation/seth_lloyd-the-computational-universe</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3222\"><span style=\"font-family: verdana;\">[16] \"Google Claims To Achieve Quantum Supremacy — IBM Pushes Back\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.npr.org/2019/10/23/772710977/google-claims-to-achieve-quantum-supremacy-ibm-pushes-back\" target=\"_self\">https://www.npr.org/2019/10/23/772710977/google-claims-to-achieve-quantum-supremacy-ibm-pushes-back</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3223\"><span style=\"font-family: verdana;\">[17] \"A route to scalable Majorana qubits\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://phys.org/news/2024-06-route-scalable-majorana-qubits.html\" target=\"_self\">https://phys.org/news/2024-06-route-scalable-majorana-qubits.html</a></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3224\"><span style=\"font-family: verdana;\">[18] \"DARPA's quantum computing is powered by ... FOMO\",<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.theregister.com/2023/02/02/darpa_quantum_microsoft/\" target=\"_self\">https://www.theregister.com/2023/02/02/darpa_quantum_microsoft/</a></span></p><div><br /></div>",
            "url": "https://hpc.social/personal-blog/2024/surfing-the-singularity-the-universe-computes/",
            
            
            
            
            
            "date_published": "2024-09-30T16:00:00-06:00",
            "date_modified": "2024-09-30T16:00:00-06:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/the-hpc-cluster-as-a-reflection-of-values/",
            "title": "The HPC cluster as a reflection of values",
            "summary": null,
            "content_text": "Yesterday while I was cooking dinner, I happened to re-watch Bryan Cantrill&#8217;s talk on &#8220;Platform as a Reflection of Values&#8220;. (I watch a lot tech talks while cooking or baking &#8212; I often have trouble focusing on a video unless I&#8217;m doing something with my hands, but if I know a recipe well I can often make it on autopilot.)If you haven&#8217;t watched this talk before, I encourage checking it out. Cantrill gave it in part to talk about why the node.js community and Joyent didn&#8217;t work well together, but I thought he had some good insights into how values get built into a technical artifact itself, as well as how the community around those artifacts will prioritize certain values.While I was watching the talk (and chopping some vegetables), I started thinking about what values are most important in the &#8220;HPC cluster platform&#8221;.Technical valuesThis slide from the talk shows some examples of what Cantrill thinks of as platform values:A key point from the talk is that all of these are good things! Ideally you want to have all of these things when you build a new platform, whether that&#8217;s a programming language, a cloud platform, or whatever. But any given platform will choose to prioritize some set of values over others. You want them all, but when they come into tension, which ones will win?One example that Cantrill gives in the talk is the original Unix out of Bell Labs, which prioritized simplicity, composability, and portability. Certainly Unix wanted other features, like performance and maintainability, but if forced into a choice like performance vs simplicity, it would generally choose simplicity. Similarly, he talked about how JavaScript and node.js are built around values like approachability, expressiveness, and velocity, and how that contrasted with values like robustness and debuggability that Joyent valued as a cloud provider.The HPC cluster platformWhen I saw &#8220;HPC cluster platform&#8221;, I&#8217;m loosely talking about the collection of hardware and software that is most often used to build high-performance computing clusters for workloads like scientific research or machine learning training.This generic platform consists of a large collection of identical compute nodes, orchestrated by a batch scheduler like Slurm or PBS, and with one or more &#8220;login nodes&#8221; serving as a front-end where users SSH in to prepare and run jobs on the cluster. For multi-node jobs and high-speed storage access, the compute nodes are connected by a very high-speed network, like 100Gb Ethernet or InfiniBand, which needs specific libraries to use effectively. Users on the cluster have access to command-line editors and development tools like compilers and scientific libraries, but mostly interact with the platform in a purely command line environment.See also, this really ugly Google Draw diagram:What values does this platform prioritize? In general, I tend to think that HPC platforms prioritize performance, portability, and approachability.Performance: This might seem obvious given the name &#8220;HPC&#8221;, but it&#8217;s worth thinking a little more about. When faced with a choice between performance and some other value, HPC clusters almost always choose performance. Performance is generally performance above cost, with most clusters using expensive compute and networking hardware. It&#8217;s prioritized over observability (&#8220;measurability&#8221; on Cantrill&#8217;s slide?), with most HPC clusters I&#8217;m aware of disabling most active monitoring features if they have a performance cost. It&#8217;s even prioritized above security, often turning off security features if they lead to lower performance or even measurable performance variability.Portability: Mindful of the difficulty in writing high-performance, correct scientific code, the HPC platform works reasonably hard to maintain portability to new hardware and software over time. A lot of this is due to a robust ecosystem of libraries and middleware. Most applications that scale across multiple nodes still use MPI; code doing linear algebra still depends on long-lived libraries like LAPACK and BLAS; and platform tools like the scheduler tend to be remarkably stable over time. New hardware features are often abstracted by middleware, especially at the networking level where support is built into your MPI library of choice.This story isn&#8217;t perfect &#8212; applications usually need recompilation on a new cluster, and still often need major changes to take advantages of new features. That&#8217;s why I chose &#8220;portability&#8221; instead of &#8220;compatibility&#8221;. But as a cluster admin, I&#8217;ve worked with many researchers who have maintained the same app on many different clusters for 10, 20, or even 30 years, which is a pretty impressive portability story.Approachability: This one may be controversial! The average HPC cluster can seem pretty arcane, especially for someone new to the platform. But I do think that HPC prioritizes a particular kind of approachability, which is that it is designed to onboard scientific researchers who are not themselves expert developers.A new user onboarding to a research HPC cluster frequently needs to understand three main tools:The Linux shell: Most HPC cluster environments are entirely command-line oriented (though Open OnDemand is helping change this!). You log in with SSH; edit using nano, vim, or emacs; and interact with the system entirely using a shell.The cluster scheduler: When you have your application ready to go, you submit your job to a queue using a cluster scheduler like Slurm and wait for it to complete. Cluster schedulers have a lot of moving parts and a user can often find endless knobs to tune, but it&#8217;s easy to get started with just a few commands. (And interestingly, almost all HPC cluster schedulers define their jobs as&#8230; shell scripts! You&#8217;re back to needing to know the shell. Annoying, sure, but at least it ain&#8217;t YAML!)Environment modules: This tool allows the cluster admins to provide a large library of libraries and tools, with specific versions, such that a cluster user just needs to type &#8220;module load openmpi/3&#8221;. While the tool munges the shell environment variables as needed to set up PATH, LD_LIBRARY_PATH, etc just so.Now if this doesn&#8217;t sound like a robust software engineering environment&#8230; it isn&#8217;t! There are endless things that can go wrong, especially with environment modules interacting with the user&#8217;s own shell rc files and who knows what else. And there&#8217;s very little in this environment to encourage best practices like linting, pinned library versions, or even version control at all!But this environment is approachable&#8230; if you&#8217;re a graduate student in a field like physics or biology, running an existing application or writing your own simulation or data processing code. But who never got to take a class on software engineering, and where the code itself is not a first class deliverable. The deliverable is the published paper.But what about all those other values?They&#8217;re still important! But the point of this exercise is to think about which values are will &#8220;win&#8221; when they come into tension. And I do think that, if you look at HPC clusters in general, this is the set of values that will win.Availability is important, but not if that work costs us (much) performance. Velocity is great, but we&#8217;ll de-prioritize it in the name of workload portability. Security is essential &#8212; but we don&#8217;t want to make it harder to onboard new grad students&#8230;You cluster is not the generic platform (and neither is mine)A last point I want to make is that there&#8217;s actually no such thing as the &#8220;generic HPC cluster platform&#8221;. Each individual cluster, at a university or company or government lab, is often configured in a unique way based on the hardware, performance goals, and whims of the person setting it up.Because of this, each individual HPC cluster may prioritize different values. A cluster at a national lab may choose security at the expense of approachability; or a different cluster may choose to sacrifice portability in the name of velocity if they&#8217;re developing on a new hardware or software system.(Also, the systems I build as part of my day job also make very different choices than the &#8220;generic&#8221; cluster would. To a first approximation, I think I&#8217;d say we choose performance/debuggability/portability/security&#8230; but we also make different choices depending on what we&#8217;re building!)But I still think that performance, portability, and approachability represent the most common platform values I&#8217;ve seen in the HPC field as a whole. And I think the tools and practices we use bias towards those values.However&#8230; all of that is what I thought about while making dinner! If you think a different set of values makes more sense, feel free to send me an email and let me know. ",
            "content_html": "<p>Yesterday while I was cooking dinner, I happened to re-watch Bryan Cantrill&#8217;s talk on &#8220;<a href=\"https://www.youtube.com/watch?v=Xhx970_JKX4\">Platform as a Reflection of Values</a>&#8220;. (I watch a lot tech talks while cooking or baking &#8212; I often have trouble focusing on a video unless I&#8217;m doing something with my hands, but if I know a recipe well I can often make it on autopilot.)</p><p>If you haven&#8217;t watched this talk before, I encourage checking it out. Cantrill gave it in part to talk about why the node.js community and Joyent didn&#8217;t work well together, but I thought he had some good insights into how values get built into a technical artifact itself, as well as how the community around those artifacts will prioritize certain values.</p><p>While I was watching the talk (and chopping some vegetables), I started thinking about what values are most important in the &#8220;HPC cluster platform&#8221;.</p><p><span id=\"more-339\"></span></p><h2 class=\"wp-block-heading\">Technical values</h2><p>This slide from the talk shows some examples of what Cantrill thinks of as platform values:</p><figure class=\"wp-block-image size-full\"><img alt=\"A slide with the title &quot;Some platform values&quot;. The list includes approachability, availability, compatibility, composability, debuggability, expressiveness, extensibility, interoperability, integrity, maintainability, operability, performance, portability, resiliency, rigor, robustness, safety, security, simplicity, thoroughness, transparency, and velocity.\" class=\"wp-image-340\" height=\"538\" src=\"https://thinking.ajdecon.org/wp-content/uploads/2024/09/image.png\" width=\"969\" /></figure><p>A key point from the talk is that all of these are good things! Ideally you want to have <em>all</em> of these things when you build a new platform, whether that&#8217;s a programming language, a cloud platform, or whatever. But any given platform will choose to<em> </em>prioritize some set of values over others. You want them all, but when they come into tension, which ones will win?</p><p>One example that Cantrill gives in the talk is the original Unix out of Bell Labs, which prioritized simplicity, composability, and portability. Certainly Unix wanted other features, like performance and maintainability, but if forced into a choice like performance vs simplicity, it would generally choose simplicity. Similarly, he talked about how JavaScript and node.js are built around values like approachability, expressiveness, and velocity, and how that contrasted with values like robustness and debuggability that Joyent valued as a cloud provider.</p><h2 class=\"wp-block-heading\">The HPC cluster platform</h2><p>When I saw &#8220;HPC cluster platform&#8221;, I&#8217;m loosely talking about the collection of hardware and software that is most often used to build high-performance computing clusters for workloads like scientific research or machine learning training.</p><p>This generic platform consists of a large collection of identical compute nodes, orchestrated by a batch scheduler like <a href=\"https://github.com/SchedMD/slurm\">Slurm</a> or <a href=\"https://github.com/openpbs/openpbs\">PBS</a>, and with one or more &#8220;login nodes&#8221; serving as a front-end where users SSH in to prepare and run jobs on the cluster. For multi-node jobs and high-speed storage access, the compute nodes are connected by a very high-speed network, like 100Gb Ethernet or InfiniBand, which needs specific libraries to use effectively. Users on the cluster have access to command-line editors and development tools like compilers and scientific libraries, but mostly interact with the platform in a purely command line environment.</p><p>See also, this really ugly Google Draw diagram:</p><figure class=\"wp-block-image size-large\"><img alt=\"A simple diagram showing a login node, a set of compute nodes, and network storage. The login node is connected to compute nodes by a management network. The storage is connected to compute nodes by a high-speed network.\" class=\"wp-image-348\" height=\"488\" src=\"https://thinking.ajdecon.org/wp-content/uploads/2024/09/image-1-1024x488.png\" width=\"1024\" /></figure><p>What values does this platform prioritize? In general, I tend to think that HPC platforms prioritize <em>performance</em>, <em>portability</em>, and <em>approachability</em>.</p><p><strong>Performance: </strong>This might seem obvious given the name &#8220;HPC&#8221;, but it&#8217;s worth thinking a little more about. When faced with a choice between performance and some other value, HPC clusters <em>almost always</em> choose performance. <br /><br />Performance is generally performance above cost, with most clusters using expensive compute and networking hardware. It&#8217;s prioritized over observability (&#8220;measurability&#8221; on Cantrill&#8217;s slide?), with most HPC clusters I&#8217;m aware of disabling most active monitoring features if they have a performance cost. It&#8217;s even prioritized above security, often turning off security features if they lead to lower performance or even measurable performance <em>variability</em>.</p><p><strong>Portability: </strong>Mindful of the difficulty in writing high-performance, correct scientific code, the HPC platform works reasonably hard to maintain portability to new hardware and software over time. </p><p>A lot of this is due to a robust ecosystem of libraries and middleware. Most applications that scale across multiple nodes still use <a href=\"https://en.wikipedia.org/wiki/Message_Passing_Interface\">MPI</a>; code doing linear algebra still depends on long-lived libraries like <a href=\"https://www.netlib.org/lapack/\">LAPACK</a> and <a href=\"https://www.netlib.org/blas/\">BLAS</a>; and platform tools like the scheduler tend to be remarkably stable over time. New hardware features are often abstracted by middleware, especially at the networking level where support is built into your MPI library of choice.</p><p>This story isn&#8217;t perfect &#8212; applications usually need recompilation on a new cluster, and still often need major changes to take advantages of new features. That&#8217;s why I chose &#8220;portability&#8221; instead of &#8220;compatibility&#8221;. But as a cluster admin, I&#8217;ve worked with many researchers who have maintained the same app on many different clusters for 10, 20, or even 30 years, which is a pretty impressive portability story.</p><p><strong>Approachability: </strong>This one may be controversial! The average HPC cluster can seem pretty arcane, especially for someone new to the platform. But I do think that HPC prioritizes a particular <em>kind</em> of approachability, which is that it is designed to onboard scientific researchers who are not themselves expert developers.</p><p>A new user onboarding to a research HPC cluster frequently needs to understand three main tools:</p><ul class=\"wp-block-list\"><li><strong>The Linux shell:</strong> Most HPC cluster environments are entirely command-line oriented (though <a href=\"https://openondemand.org/\">Open OnDemand</a> is helping change this!). You log in with SSH; edit using nano, vim, or emacs; and interact with the system entirely using a shell.</li><li><strong>The cluster scheduler: </strong>When you have your application ready to go, you submit your job to a queue using a cluster scheduler like Slurm and wait for it to complete. Cluster schedulers have a lot of moving parts and a user can often find endless knobs to tune, but it&#8217;s easy to get started with just a few commands. (And interestingly, almost all HPC cluster schedulers define their jobs as&#8230; shell scripts! You&#8217;re back to needing to know the shell. Annoying, sure, but at least it ain&#8217;t YAML!)</li><li><a href=\"https://modules.readthedocs.io/en/latest/\"><strong>Environment modules</strong></a>: This tool allows the cluster admins to provide a large library of libraries and tools, with specific versions, such that a cluster user just needs to type &#8220;module load openmpi/3&#8221;. While the tool munges the shell environment variables as needed to set up PATH, LD_LIBRARY_PATH, etc just so.</li></ul><p>Now if this doesn&#8217;t sound like a robust software engineering environment&#8230; it isn&#8217;t! There are endless things that can go wrong, especially with environment modules interacting with the user&#8217;s own shell rc files and who knows what else. And there&#8217;s very little in this environment to encourage best practices like linting, pinned library versions, or even version control at all!</p><p>But this environment is <em>approachable</em>&#8230; if you&#8217;re a graduate student in a field like physics or biology, running an existing application or writing your own simulation or data processing code. But who never got to take a class on software engineering, and where the code itself is not a first class deliverable. The deliverable is the published paper.</p><h2 class=\"wp-block-heading\">But what about all those other values?</h2><p>They&#8217;re still important! But the point of this exercise is to think about which values are will &#8220;win&#8221; when they come into tension. And I do think that, if you look at HPC clusters in general, this is the set of values that will win.</p><p>Availability is important, but not if that work costs us (much) performance. Velocity is great, but we&#8217;ll de-prioritize it in the name of workload portability. Security is essential &#8212; but we don&#8217;t want to make it harder to onboard new grad students&#8230;</p><h2 class=\"wp-block-heading\">You cluster is not the generic platform (and neither is mine)</h2><p>A last point I want to make is that there&#8217;s actually <em>no such thing</em> as the &#8220;generic HPC cluster platform&#8221;. Each individual cluster, at a university or company or government lab, is often configured in a unique way based on the hardware, performance goals, and whims of the person setting it up.</p><p>Because of this, each <em>individual</em> HPC cluster may prioritize different values. A cluster at a national lab may choose security at the expense of approachability; or a different cluster may choose to sacrifice portability in the name of velocity if they&#8217;re developing on a new hardware or software system.</p><p>(Also, the systems I build as part of my day job also make <em>very</em> different choices than the &#8220;generic&#8221; cluster would. To a first approximation, I think I&#8217;d say we choose performance/debuggability/portability/security&#8230; but we also make different choices depending on what we&#8217;re building!)</p><p>But I still think that <em>performance</em>, <em>portability</em>, and <em>approachability</em> represent the most common platform values I&#8217;ve seen in the HPC field as a whole. And I think the tools and practices we use bias towards those values.</p><p>However&#8230; all of that is what I thought about while making dinner! If you think a different set of values makes more sense, feel free to <a href=\"mailto:ajdecon@ajdecon.org\">send me an email</a> and let me know. <img alt=\"😉\" class=\"wp-smiley\" src=\"https://s.w.org/images/core/emoji/15.0.3/72x72/1f609.png\" style=\"height: 1em;\" /></p>",
            "url": "https://hpc.social/personal-blog/2024/the-hpc-cluster-as-a-reflection-of-values/",
            
            
            
            
            
            "date_published": "2024-09-29T22:22:51-06:00",
            "date_modified": "2024-09-29T22:22:51-06:00",
            
                "author": "Thinking Out Loud"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/surfing-the-singularity-staying-relevant-in-a-time-of-rapid-change/",
            "title": "Surfing the Singularity - Staying Relevant in a Time of Rapid Change",
            "summary": null,
            "content_text": "The If you've been tracking the technology industry, and the software space in particular, for any amount of time you've witnessed the accelerating rate of technical change - it was always there, but now its become impossible to miss. The rate of technological change has seemed exponential for a while now, but recent advancements in AI have pushed this curve to new heights. An Accenture report released for Davos 2024 suggests that technical rate of change is seen by C-level leaders as the number one most impactful force on their business - more than financial or geopolitical matters - largely as a result of advances in various forms of AI tooling. [1] Of those surveyed, 88% see the rate of change increasing even further, and half say their organizations are not ready, even though 70% see it as a revenue opportunity. Dinosaur Developers? Today staying alive in business, especially the business of software engineering, means surfing increasingly turbulent and potentially disruptive waters. Consider the leaked recent remarks of Amazon Web Services CEO Matt Garman, wherein he suggested that a mere 2 years from now most AWS programmers wouldn't be coding. [2] In their Q2 investor call, Amazon cited 4,500 person-years of savings through the use of AI assistants on mostly mundane programming tasks like porting and hardening code with patterns of best practices. [3] While the International Monetary Fund suggests AI will impact 60% of jobs and increase wealth inequality, the jobs impacted are more likely to be skewed to higher income countries. [4] These remarks from influential leaders in the industry suggest that the impact of AI will be felt most acutely among software practitioners. Those of us who use integrated development environments (IDEs) to write code (and documents, like this one) with AI assist already are familiar with the benefits. For those unwilling to adapt, to retool and upscale their skills, the future might be bleak. Growing might mean zooming out from code to a more soup-to-nuts view of the software engineering process, especially specification and validation - the need to clearly state requirements and validate results without an immediate need to focus on implementation details. Notice that in the below diagram, taken from the Federal Aviation Administration which is increasingly interested in software engineering and model validation, traditional coding sits only at the bottom of the process rendered as a \"V\". [5]Development in the context of verification and validation, as seen by the FAA.So how to stay relevant in a rapidly changing world, to stay one step ahead of AI and the algorithmic job reaper? A recent LinkedIn survey of technologists suggests the number one thing a person can do is to learn new technologies. [6]A recent Gartner report [7] of the 30 most impactful technologies lists quantum computing as a weighty albeit distant critical enabler. Why? For starters, the existence of Shor's quantum-based numerical factoring algorithm means its a matter of when, not if, quantum computers will be used to crack existing military-grade encryption. In the hands of an adversary, especially when unknown as with the Enigma machine in WWII, the results could be catastrophic, and this is a good part of what is fueling the current government interest in quantum computing.  Off to Quantum Summer School So for me, it was back to school. Summer school. First I hit the stacks, brushed my very stale self up on the fundamentals of the necessary calculus and linear algebra, the quantum mechanics to at least an undergraduate level of understanding, read several texts on the subject of quantum computing including the K&amp;R of quantum \"Mike &amp; Ike\", consumed mass quantities of videos from companies like IBM and Xanadu, and kicked the tires on their programming tool kits. Next I traveled a short distance from my home office to the Griffiss Institute in Rome NY at their now annual \"Quantum 4 International\" conference. This consisted of an impressive array of researchers and government administrators presenting their latest findings and laboratory results, often in a sort of national inventory of funded priority projects. The US Air Force, which maintains a research presence in Rome NY, is particularly interested in quantum computing and networking, for example, scaling up to a larger quantum computer by networking (entangling) a set of smaller ones. The Army and Navy are more focused on other non-computing aspects of quantum technology - sensing, magnetics, material defect identification, and as radio receivers. The Canadian delegation was focused on many of the same research topics, as well as a national emphasis on quantum technology education - to be impactful in quantum computing, one must be able to meld a variety of maths, physics, and programming skills with an unusual level of creativity to design novel and efficient algorithms which take advantage of the power of the quantum qubit - as a former college adjunct, this is no small educational challenge. Finally, researchers from the EU demonstrated new upper bounds on entanglement at a distance for wider area networking, and the use of novel estimation techniques to scale up quantum simulators in this \"NISQ\" era where real quantum computers are still small, noisy, fragile, and scarce. What was noticeably lacking was the demonstration of any current industrial utility for quantum computing applications, and the head of DARPA saw none emerging until we collectively move beyond the NISQ era. Pack a Remote Lunch While some industrial domains like chemistry will likely gain utility first, the head of DARPA suggests that utility in my own current application area - computational fluid dynamics (CFD) - will not emerge until we move into the \"cryptographically relevant\" era. It was with this in mind that I remotely attended the von Karman Institute for Fluid Dynamics in Belgium for a week-long course called \"Introduction to Quantum Computing in Fluid Dynamics\" funded by NATO. Entirely civilian in nature, the training was aimed at CFD researchers who might take advantage of one of the quantum facilities being installed at national laboratories in the US and EU, often collocated with their existing high performance computing (HPC) clusters. Not being a physicist, for me much of the class was consumed for general domain literacy, and the \"tl;dr\" is the re-emergence of particle-based methods like Lattice Boltzmann as a focus of research over finite volume methods and solving the Navier-Stokes equations, as is currently dominant in HPC-based CFD. With mind fully blown by the Von Karman experience, I next took two weeks and attended the IBM Global Quantum Summer School, 2024 edition, consisting of 10 lectures on a variety of topics and 4 labs. The videos are now posted on YouTube [8] and while I personally enjoyed the lecture on Hamiltonian simulation, there was a distinct and unfortunately NISQ-era necessity to focus on error correction and compensating for noise, and on the inner workings of the IBM Qiskit transpiler. In the latter case, because of the diverse nature of the emerging quantum hardware, because inter-qubit connectivity is often not N-way, and because at this stage things often break, it becomes common to mess with the compiler, and to adopt a toolchain with an eye to portability. Qiskit, a library and tool set for Python, is one of a couple frameworks (another being PennyLane) which currently meet this need, and the labs went to length to expose the student to the various topological mapping, translation, and optimization stages which are present in the quantum programming toolchain. And we got to play with a Hamiltonian simulation up to 50 qubits on real hardware, as most classical machines would have a hard time managing the simultaneous behavior of 50 spins.Next Up: AI Assistants &amp; Hybrid Quantum Computing During the Qiskit labs, naturally I was using LLM assist in my IDE, at minimum for tedious or repetitive tasks. But it was remarkable how often the AI assistant was helpful, even for a seemingly niche programming task such as using a quantum computing framework. I intend to delve into this topic more in a future blog and share my experiences with the various emerging AI tools for code and document assist, as well as in the broader end-to-end software engineering context. In addition, I intend to share future blog installments as my quantum education in search of industrial utility continues through the fall conference season. As a software engineer, I'll be particularly on the lookout for frameworks, including those which leverage AI, which allow the programmer to rise above the level of 1950s-like qubits and gates to higher and portable constructs. I'll also be sharing learnings on the rise of classical-quantum hybrids, especially in HPC contexts, as today's quantum approaches such as variational algorithms which converge on solutions require it. Here is another place where toolchains will play a major role, and where heterogeneous workflows which utilize AI tools will likely be impactful. Until next time, enjoy these last few weeks of summer.- andy References: 0. Photo by Ben Wicks on Unsplash1. https://www.accenture.com/us-en/about/company/pulse-of-change2. https://www.businessinsider.com/aws-ceo-developers-stop-coding-ai-takes-over-2024-83. https://accelerationeconomy.com/cloud-wars/amazon-genai-slashes-260-million-in-costs-saves-4500-years/4. https://nypost.com/2024/01/15/business/ai-will-affect-60-of-us-jobs-imf-warns/5. https://www.faa.gov/sites/faa.gov/files/2024-07/d_VVFlow_2024Mar21.jpg6. https://www.linkedin.com/advice/0/how-can-you-stay-relevant-software-development-skills-it-services-kxeme7. https://www.gartner.com/en/articles/30-emerging-technologies-that-will-guide-your-business-decisions8. https://www.youtube.com/playlist?list=PLOFEBzvs-Vvr-GzDWlZpAcDpki5jUqYJu",
            "content_html": "<p class=\"ember-view reader-text-block__paragraph\" id=\"ember3428\">The If you've been tracking the technology industry, and the software space in particular, for any amount of time you've witnessed the<span class=\"white-space-pre\"> </span><span>accelerating</span><span class=\"white-space-pre\"> </span>rate of technical change - it was always there, but now its become impossible to miss.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3428\">The rate of technological change has seemed exponential for a while now, but recent advancements in AI have pushed this curve to new heights.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3429\">An Accenture report released for Davos 2024 suggests that technical rate of change is seen by C-level leaders as the number one most impactful force on their business - more than financial or geopolitical matters - largely as a result of advances in various forms of AI tooling. [1] Of those surveyed, 88% see the rate of change increasing even further, and half say their organizations are not ready, even though 70% see it as a revenue opportunity.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3430\"><br /></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3431\"><span style=\"font-size: x-large;\">Dinosaur Developers?<span class=\"white-space-pre\"> </span></span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3432\">Today staying alive in business, especially the business of software engineering, means surfing increasingly turbulent and potentially disruptive waters. Consider the leaked recent remarks of Amazon Web Services CEO Matt Garman, wherein he suggested that a mere 2 years from now most AWS programmers wouldn't be coding. [2] In their Q2 investor call, Amazon cited 4,500<span class=\"white-space-pre\"> </span><span>person-years</span><span class=\"white-space-pre\"> </span>of savings through the use of AI assistants on mostly mundane programming tasks like porting and hardening code with patterns of best practices. [3]<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3433\">While the International Monetary Fund suggests AI will impact 60% of jobs and increase wealth inequality, the jobs impacted are more likely to be skewed to higher income countries. [4] These remarks from influential leaders in the industry suggest that the impact of AI will be felt most acutely among software practitioners. Those of us who use integrated development environments (IDEs) to write code (and documents, like this one) with AI assist already are familiar with the benefits. For those unwilling to adapt, to retool and upscale their skills, the future might be bleak. Growing might mean zooming out from code to a more soup-to-nuts view of the software engineering process, especially specification and validation - the need to clearly state requirements and validate results without an immediate need to focus on implementation details. Notice that in the below diagram, taken from the Federal Aviation Administration which is increasingly interested in software engineering and model validation, traditional coding sits only at the bottom of the process rendered as a \"V\". [5]</p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3434\"><br /></p><div class=\"reader-image-block reader-image-block--full-width\"><figure class=\"reader-image-block__figure\"><div class=\"ivm-image-view-model   \"><div class=\"ivm-view-attr__img-wrapper                \"><img alt=\"\" class=\"ivm-view-attr__img--centered  reader-image-block__img evi-image lazy-image ember-view\" id=\"ember3435\" src=\"https://media.licdn.com/dms/image/v2/D4E12AQEOZqWZEle0-Q/article-inline_image-shrink_1500_2232/article-inline_image-shrink_1500_2232/0/1725982434333?e=1740009600&amp;v=beta&amp;t=IY_I5FqIuF9dAu5ikJgAtZat6UOQ67wPDShSYD2loeU\" /></div></div><figcaption class=\"reader-image-block__figure-image-caption display-block full-width text-body-small-open t-sans text-align-center t-black--light\">Development in the context of verification and validation, as seen by the FAA.</figcaption></figure></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3436\">So how to stay relevant in a rapidly changing world, to stay one step ahead of AI and the algorithmic job reaper? A recent LinkedIn survey of technologists suggests the number one thing a person can do is to learn new technologies. [6]</p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3437\">A recent Gartner report [7] of the 30 most impactful technologies lists quantum computing as a weighty albeit distant critical enabler. Why? For starters, the existence of Shor's quantum-based numerical factoring algorithm means its a matter of when, not if, quantum computers will be used to crack existing military-grade encryption. In the hands of an adversary, especially when unknown as with the Enigma machine in WWII, the results could be catastrophic, and this is a good part of what is fueling the current government interest in quantum computing.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3438\"><span class=\"white-space-pre\"> </span></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3439\"><span style=\"font-size: x-large;\">Off to Quantum Summer School<span class=\"white-space-pre\"> </span></span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3440\">So for me, it was back to school. Summer school. First I hit the stacks, brushed my very stale self up on the fundamentals of the necessary calculus and linear algebra, the quantum mechanics to at least an undergraduate level of understanding, read several texts on the subject of quantum computing including the K&amp;R of quantum \"Mike &amp; Ike\", consumed mass quantities of videos from companies like IBM and Xanadu, and kicked the tires on their programming tool kits.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3441\">Next I traveled a short distance from my home office to the<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/griffiss-institute/\">Griffiss Institute</a><span class=\"white-space-pre\"> </span>in Rome NY at their now annual \"Quantum 4 International\" conference. This consisted of an impressive array of researchers and government administrators presenting their latest findings and laboratory results, often in a sort of national inventory of funded priority projects. The US Air Force, which maintains a research presence in Rome NY, is particularly interested in quantum computing and networking, for example, scaling up to a larger quantum computer by networking (entangling) a set of smaller ones. The Army and Navy are more focused on other non-computing aspects of quantum technology - sensing, magnetics, material defect identification, and as radio receivers. The Canadian delegation was focused on many of the same research topics, as well as a national emphasis on quantum technology education - to be impactful in quantum computing, one must be able to meld a variety of maths, physics, and programming skills with an unusual level of creativity to design novel and efficient algorithms which take advantage of the power of the quantum qubit - as a former college adjunct, this is no small educational challenge. Finally, researchers from the EU demonstrated new upper bounds on entanglement at a distance for wider area networking, and the use of novel estimation techniques to scale up quantum simulators in this \"NISQ\" era where real quantum computers are still small, noisy, fragile, and scarce. What was noticeably lacking was the demonstration of any current industrial utility for quantum computing applications, and the head of DARPA saw none emerging until we collectively move beyond the NISQ era.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3442\"><br /></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3443\"><span style=\"font-size: x-large;\">Pack a Remote Lunch<span class=\"white-space-pre\"> </span></span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3444\">While some industrial domains like chemistry will likely gain utility first, the head of DARPA suggests that utility in my own current application area - computational fluid dynamics (CFD) - will not emerge until we move into the \"cryptographically relevant\" era. It was with this in mind that I remotely attended the<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/vki-vonkarmaninstitute/\">von Karman Institute for Fluid Dynamics</a><span class=\"white-space-pre\"> </span>in Belgium for a week-long course called \"Introduction to Quantum Computing in Fluid Dynamics\" funded by NATO. Entirely civilian in nature, the training was aimed at CFD researchers who might take advantage of one of the quantum facilities being installed at national laboratories in the US and EU, often collocated with their existing high performance computing (HPC) clusters. Not being a physicist, for me much of the class was consumed for general domain literacy, and the \"tl;dr\" is the re-emergence of particle-based methods like Lattice Boltzmann as a focus of research over finite volume methods and solving the Navier-Stokes equations, as is currently dominant in HPC-based CFD.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3445\">With mind fully blown by the Von Karman experience, I next took two weeks and attended the IBM Global Quantum Summer School, 2024 edition, consisting of 10 lectures on a variety of topics and 4 labs. The videos are now posted on YouTube [8] and while I personally enjoyed the lecture on Hamiltonian simulation, there was a distinct and unfortunately NISQ-era necessity to focus on error correction and compensating for noise, and on the inner workings of the IBM<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/qiskit/\">Qiskit</a><span class=\"white-space-pre\"> </span>transpiler. In the latter case, because of the diverse nature of the emerging quantum hardware, because inter-qubit connectivity is often not N-way, and because at this stage things often break, it becomes common to mess with the compiler, and to adopt a toolchain with an eye to portability. Qiskit, a library and tool set for Python, is one of a couple frameworks (another being<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/company/pennylaneai/\">PennyLane</a>) which currently meet this need, and the labs went to length to expose the student to the various topological mapping, translation, and optimization stages which are present in the quantum programming toolchain. And we got to play with a Hamiltonian simulation up to 50 qubits on real hardware, as most classical machines would have a hard time managing the simultaneous behavior of 50 spins.</p><div class=\"reader-embed-block__iframe-embed\"></div><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3446\"><br /></p><h3 class=\"ember-view reader-text-block__heading-3\" id=\"ember3447\"><span style=\"font-size: x-large;\">Next Up: AI Assistants &amp; Hybrid Quantum Computing<span class=\"white-space-pre\"> </span></span></h3><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3448\">During the Qiskit labs, naturally I was using LLM assist in my IDE, at minimum for tedious or repetitive tasks. But it was remarkable how often the AI assistant was helpful, even for a seemingly niche programming task such as using a quantum computing framework. I intend to delve into this topic more in a future blog and share my experiences with the various emerging AI tools for code and document assist, as well as in the broader end-to-end software engineering context.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3449\">In addition, I intend to share future blog installments as my quantum education in search of industrial utility continues through the fall conference season. As a software engineer, I'll be particularly on the lookout for frameworks, including those which leverage AI, which allow the programmer to rise above the level of 1950s-like qubits and gates to higher and portable constructs. I'll also be sharing learnings on the rise of classical-quantum hybrids, especially in HPC contexts, as today's quantum approaches such as variational algorithms which converge on solutions require it. Here is another place where toolchains will play a major role, and where heterogeneous workflows which utilize AI tools will likely be impactful.<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3450\">Until next time, enjoy these last few weeks of summer.</p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3451\">- andy<span class=\"white-space-pre\"> </span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3452\"><br /></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3453\"><span style=\"font-size: x-large;\">References:<span class=\"white-space-pre\"> </span></span></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3454\">0. Photo by<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://unsplash.com/@profwicks?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash\" target=\"_self\">Ben Wicks</a><span class=\"white-space-pre\"> </span>on<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://unsplash.com/photos/green-and-blue-light-bokeh-Ia-qPL-HQdA?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash\" target=\"_self\">Unsplash</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3455\">1.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.accenture.com/us-en/about/company/pulse-of-change\" target=\"_self\">https://www.accenture.com/us-en/about/company/pulse-of-change</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3456\">2.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.businessinsider.com/aws-ceo-developers-stop-coding-ai-takes-over-2024-8\" target=\"_self\">https://www.businessinsider.com/aws-ceo-developers-stop-coding-ai-takes-over-2024-8</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3457\">3.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://accelerationeconomy.com/cloud-wars/amazon-genai-slashes-260-million-in-costs-saves-4500-years/\" target=\"_self\">https://accelerationeconomy.com/cloud-wars/amazon-genai-slashes-260-million-in-costs-saves-4500-years/</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3458\">4.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://nypost.com/2024/01/15/business/ai-will-affect-60-of-us-jobs-imf-warns/\" target=\"_self\">https://nypost.com/2024/01/15/business/ai-will-affect-60-of-us-jobs-imf-warns/</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3459\">5.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.faa.gov/sites/faa.gov/files/2024-07/d_VVFlow_2024Mar21.jpg\" target=\"_self\">https://www.faa.gov/sites/faa.gov/files/2024-07/d_VVFlow_2024Mar21.jpg</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3460\">6.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.linkedin.com/advice/0/how-can-you-stay-relevant-software-development-skills-it-services-kxeme\" target=\"_self\">https://www.linkedin.com/advice/0/how-can-you-stay-relevant-software-development-skills-it-services-kxeme</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3461\">7.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.gartner.com/en/articles/30-emerging-technologies-that-will-guide-your-business-decisions\" target=\"_self\">https://www.gartner.com/en/articles/30-emerging-technologies-that-will-guide-your-business-decisions</a></p><p class=\"ember-view reader-text-block__paragraph\" id=\"ember3462\">8.<span class=\"white-space-pre\"> </span><a class=\"bpCpipVrrjRHIWtfjEjtbNsDescTJyo \" href=\"https://www.youtube.com/playlist?list=PLOFEBzvs-Vvr-GzDWlZpAcDpki5jUqYJu\" target=\"_self\">https://www.youtube.com/playlist?list=PLOFEBzvs-Vvr-GzDWlZpAcDpki5jUqYJu</a></p>",
            "url": "https://hpc.social/personal-blog/2024/surfing-the-singularity-staying-relevant-in-a-time-of-rapid-change/",
            
            
            
            
            
            "date_published": "2024-09-10T16:00:00-06:00",
            "date_modified": "2024-09-10T16:00:00-06:00",
            
                "author": "Surfing the Singularity"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/how-has-life-after-leaving-the-labs-been-going/",
            "title": "How has life after leaving the Labs been going?",
            "summary": null,
            "content_text": "June 2024 marked two years since I left my job at one of the world's most prestigious government HPC centers for a job in one of the world's largest technology corporations. In that time, the world of HPC has changeddramatically; just six months after I started, ChatGPT was released and triggered a gold rush in AI that is now overshadowing traditional scientific computing. This shift brought about massive HPCdeployments led by hyperscalers, challenging the long-held belief that only national governments coulddeploy and operate world-leading supercomputers. My experiences atISC'24 this past summer made clear to me that the traditional HPC community is now rethinking their role the industry, and some individuals who built their careers in public HPC are revisiting theirassumption that world-class HPC systems are limited to the public institutions that havehistorically dominated the top of the Top500 list. Ihad no idea things would unfold this way when I left my job at NERSC back in 2022, and I've been remarkably lucky tonow be a part of the largest forces driving this huge shift in HPC.One of my new offices. Nicer than my old government office, and it has free food, but it's a ninety-minute drive each way.In the spirit of openness and helping others who are facing similar career decisions, I thought I would follow up on my Life and leavingNERSC post by sharing how my professional journey from DOE HPC into cloud HPC has been going. I'll first explain the path I've traveled over these past two years, then answer some of the mostcommon questions I've been asked about this transition.As a forewarning, this is not a typical technology-focused post, and most of this might be obvious to people who already work in Big Tech. Here are the questions on which I reflected:What happened during my first two years in Corporate America?So what do I actually do?Storage product managementHPC/AI developmentAm I happy with my decision and the new job?Broadly, yesBut for a long time, noFinally, yesWhat does industry do better than the Labs?AccountabilityPace and decision makingRelevanceTechnically: securityBut the pay is good, right?How's work-life balance?Do you miss anything about working at the lab?Freedom to have an off dayTravelOpennessWould you still have left NERSC knowing what you know now?What happened during my first two years in Corporate America?I published my Life and leaving NERSC blog post on aThursday, which was my last day working at NERSC. The following Monday was my first day at the new job, and beinghired as 100% remote, it didn't feel that different; I was just booting up a Lenovo laptop (yuck) instead of aMacBook, using Teams and Outlook instead of Slack, GSuite, and Zoom, and that sort of thing.However, the job was undeniably different; whereas I used to be an engineer at NERSC, I was hired to be a \"Principal ProductManager\" within the cloud storage organization which was responsible for all object, disk, and file storageservices. Although my title was \"product manager,\" I wasn't a people manager, andI didn't manage any specific storage products. Rather, my responsibility was to act as an HPC-focused overlayacross all cloud storage services, and my job was to represent the interests of HPC users to all the people who did manage specific storage products. I didn't define product or feature roadmaps myself, but I could help those responsible for each product or service understand how to shape their roadmaps to benefit HPC workloads.I struggled in this position for a variety of reasons, so after I gave the new rolean honest six to nine months, I decided that being a storage product manager just wasn't a good fit for me.Unfortunately, I reached this decision after the yield curve inverted andmass-layoffs and hiringfreezes were implemented, so there weren't a lot of places to go other than back to a government lab.Although I wasn't thriving as a storage product manager, I did have allies that helped me navigate myday-to-day struggles, and I decided to wait until more opportunities opened up and learn as much aboutproduct management as I could in the meantime.The yield curve inverted a month after I started my new job. Not great timing.After a little over a year as a storage product manager, a new engineering role opened up within a sister team in ourHPC/AI infrastructure organization. After discussing the needs and nature of the work with the hiring manager, I applied for the job, went through theinterview process, and was eventually given a verbal offer to join his team in June 2023. Unfortunately, the globaleconomic outlook was still uncertain, and I wound up sitting in a holding pattern (as a storage productmanager) from June 2023 to November 2023. It wasn't until the week of SC'23 that I finally got the written offerletter, and I spent December wrapping up loose ends within the storage organization.On January 2, 2024, I began my new (and current) role within the company. The move was completely lateral, but I changed job titles from \"Product Manager\" to \"Software Engineer,\" and Ichanged organizations from storage to specialized compute.I say all this because my experiences in making the professional transition from government HPC to cloud HPC are colored by the fact that I really changed jobs twice. I've had both product management and engineering/development roles, and I've been in both storage and HPCorganizations.So what do I actually do?I've had two very different roles within the same orbit of HPC/AI infrastructure, so I'lldescribe them separately to give you a sense of the breadth of HPC roles possible.Storage product managementAs a storage product manager (PM), I was an inch deep but a mile wide on every storage service, every commercial HPC workload, andall the ways in which those two could touch each other. I'd guess that only 25% of my day-to-day work required deepexpertise in HPC; the remainder was either business-centric or required only understanding HPC in broad strokes. Thiswas quite unlike the things I'd done earlier in my career in the public sector, since there's not an equivalent towhat a product manager does within the DOE Labs.For example, I spent a lot of my time as a storage PM explaining the basics of HPC I/O to different teams within thecompany. When most cloud people think \"storage,\" they are really thinking about either enterprise storage (thingslike virtual disks for virtual machines) or content distribution (think serving up content for web apps). Theconcept of hundreds or thousands of VMs all writing to the same place at the same time is standard practice in theHPC world, but in the cloud world, this is a DDoSattack. Since my organization was responsible for all storage, not just HPC storage, there were a lot ofpeople who simply never had to think about the challenges that HPC people take for granted, and it could be challenging (as the new guy) to convince seasoned cloud storage PMs that some workloads legitimately need hundreds of gigabytes per second of bandwidth.As a PM, I also wound up doing a fair amount of business reporting. For example, object storage is used by all manner of cloudcustomers, so prioritizing features that specifically help HPC customers required understanding how many HPCcustomers actually used it. How do you define whether a workload is really an HPC workloads or not? InDOE, we'd waste hours debating stuff like this for no real purpose, but when I became a product manager, I had todefine this to make the business case that we needed to develop a certain feature that would only be used by HPC workloads.Finally, I did a fair amount of actual product and project management work. Get on the phone with a customer,write down what they do, and turn those into requirements. Do that a bunch of times, then synthesize a more generalrequirements document. Review it with leadership. Get approval to assign developers to work on the features to meetthose requirements. Ask other teams to develop features you need for your feature. Negotiate with everyone ondevelopment priorities in the next six months. Track progress of the development team. Produce demos to show thatprogress is being made. Present progress to leadership. That sort of thing. It's similar to being a PI on a researchgrant, except I had customers, dependencies, and ultimate accountability.As far as technical work, a lot of it revolved around meeting customers and internal partner teams where they were interms of their knowledge of HPC. I did a fair amount of technical marketing; I would come up withthe ways people should think about combining storage services together in their HPC workflows, then figure outhow to communicate that to audiences with vastly different levels of technical understanding. For example, I didn'town our Lustre product, object storageproduct, or HPC CPUnode product, but I owned thestory around how we envisioned all three services worked well together. This meant I would create slides andnarratives around this, then present them to anyone from our sales teams (who often had limited HPC-specific experience) tothe world's leading HPC centers.I also sometimes helped development teams accurately test their storage systems against HPC workloads. For example,when ChatGPT exploded, everyone wanted to know how well their storage service worked for training large languagemodels. I would talk to the engineers who trained LLMs, infer what their I/O patterns would be based ontheir description of how they did training, then design a benchmark that our developers could follow toemulate that LLM training workflow. Since I understood both the workload and the storage technology, it was often faster for me to translate between AI engineers and storage engineers rather than have them speak directly.HPC/AI developmentAs an HPC/AI engineer, my work is a lot more technical and focused. I'm on a \"white-glove support team\"that works directly with large, strategic customers in HPC and AI, so rather than working with dozens of customers and connect them to dozens of storage technologies, I work with one or two customers and the specific technologies on which they build their HPC or AI clusters. Because of this, I'd wager 95% of my day-to-day work istechnical.I don't spend much time in a terminal by virtue of my relative seniority. Instead, I sit in on a lot of internal meetings and represent the perspective of our strategic HPC and AI customers. For example, if we are trying to decide which CPU toinclude in our next HPC-optimized CPU node, I might work with our benchmarking engineers to develop a representativebenchmark and then interpret the results with the node's product managers. I'm not the person running the benchmarkmyself; instead, I might ask hard questions that the customer might ask, help decide the next experiments to run,and backstop our engineers if the customer starts poking too many holes in the work.I also function as a system architect at times; if a customer shows up with unusually large or complexHPC system requirements, I'll help translate the customer requirement (e.g., \"We need 10 TB/s of storage bandwidth)for individual product teams (e.g., \"they will be using N compute nodes and accessing storage via a network withthis topology and tapering, likely running an application that has this pattern, ...\"). This often requires understanding what the compute, network, and storage product teams are doing and being able to explain it all inwhatever terms each team understands. I also wind up sitting in on customer meetings and asking critical questionsso that we can make informed design tradeoffs.I do write code, but no more than I did when I was a system architect at NERSC. For example, I might pull PDUtelemetry from across a data center to help determine if oversubscribing the power for a future cluster wouldimpact workloads. The code itself is pretty straightforward statistical analysis, but interpreting it requires anunderstanding of a bunch of things ranging from the workload running on the nodes to how nodes are distributedacross PDUs, racks, rows, halls, and buildings.The remaining 5% of my work is not very technical and involves things I opt into because it's interesting orthe right thing to do. This might be spending time providing historical context for a business strategy document or showing up at a meeting to help explain the customer perspective to a finance or sales team.Am I happy with my decision and the new job?Yes, no, and yes.Broadly, yesI am glad I made the decision to leave NERSC and take on a job in Big Tech for a couple ofhigh-level reasons.As a product manager, I learned a lot about how businesses and corporations work to adegree that I never did when I worked at a startup and I never would have if I stayed with the government. Not onlydo I now know what the difference between gross and operating margin is, but I get it because I've had tobuild COGS and pricing models that could sustain and grow a new product. I know exactly how toprice cloud services (or any product or service, really) and where that money goes. I now pay much more attention toquarterly earnings reports, and I have a more confident opinion on what different elements of these reports sayabout a technology company's trajectory. This has equipped me with what feels like a much more completeunderstanding of the HPC industry overall.I'm also glad to work at a company that generally tries to do the right things. Itis investing heavily towards being carbonnegative (rather than just buying carbon offsets) while others are burning gas inefficiently in a race to be #1. It also matchesevery donation I make to 501(c)3 nonprofits which is a huge benefit that matches up with the ways inwhich I try to share my good fortune with others. And it beats employees over the heads with a strong, positive corporate culture which holds managers and leadersaccountable for the wellness of their employees. These sorts of things don't meaningfully exist in government, andthere are a lot of big corporations out prioritize short-term profits over the longer-term benefits that come from investing in sustainability and philanthropy.But for a long time, noHowever, I was unhappy for my first eighteen months.I took a gamble on storage product management being as interesting and fulfilling asengineering when I decided to step into this new job, and I lost that bet. I quickly came to realize that there's a big difference betweenbeing a storage person in an HPC organization and being an HPC person in a storage organization.When I worked in an HPC organization like NERSC, I was used to being the odd man outbecause parallel storage is a complicated topic that most HPC folks don't really understand. Despite that, everyone is still generally like-minded and appreciates the same things; everyone knows what MPI and InfiniBandare, and everybody knows what a checkpoint and restart might look like.Conversely, when I worked in a storage organization, I was an odd man out because nobodyreally understood HPC. The average engineer only had a vague notion of what MPI orInfiniBand accomplished. If you don't understand that MPI is what lets hundreds of servers all work on the samedistributed problem at once, it's easy to forget that an MPI application will also cause hundreds of servers toall write data at once. And if you've never used an MPI barrier, it's hard to internalize the fact that the wholeapplication stops until the slowest process finishes writing.Instead of worrying about tightly coupled applications, I realized that storage peopleworry about data availability and durability above all else. After all, storage's #1 job is to not lose data. Incontrast, it's not unusual for an HPC user to have hundreds of terabytes of data vanish because they forgot to copyit off of scratch before it got purged. This sharp difference in priorities--data durability versusperformance--causes friction, because at the end of the day, what's good for HPC (high bandwidth and low latency) isusually bad for storage (high durability and availability).The landscape of storage for HPC and storage for enterprises as I see it. If you care about one but work with people who care about the other, expect friction.These are technological differences, but they result in a persistent, elevated level of latent stress that never goes away. People tend to worry about the things they understand, and people tend to ask for helpabout the things that worry them. What this meant for me is that I spent a lot of time focusing on things thateveryone understood (like market trends, revenue, and general indicators of performance) instead of hard problemsunique to large-scale HPC. And because I was never solving the hard problems, I never got the gratification of feeling like I accomplished something that, as I learned, is an important motivator to me.To be clear, I realize that I made the decision to focus on problems that other people brought merather than carve out a space for me to work on the problems I felt were important. I'm sure that someone who wasmore tenacious and unafraid to pursue challenges that nobody else understood would have a very different experience as a PM. But after about a year, I realized that what I value and enjoy doing just isn'taligned with what a successful storage PM needs to be successful. I realized I didn't want to keep doing what I was doingfor another five years, so I decided to stop.Finally, yesI quite enjoy my role in HPC/AI engineering and development now, as it's similar towhat I used to do in the DOE. I have to learn about how different hardware, software,and systems work, and I have a lot of room to focus on challenges that play to my strengths and interests.For example, I love engaging with the HPC community, and my job still allows me to go out to the big HPC conferences to do that. At the same time, I also like getting into the guts of system behavior, and I still get to spend at least an hour or two a week doing something quantitative.My day-to-day is also steeped in that familiar feel of working in an HPC organization.Every cluster has a name that gets bandied about in meetings, and they have the same familiar challenges--fabric disruptions, firmwareupgrades, flaky nodes, and the like. The standard responsibilities are also all there; some teams perform systemadministration, others support users, and some of us focus on future system designs. But the cluster names aren't nearly as creative as those in the public sector (Eagle's real name sounds like a serial number). And they look pretty boring too; there are no fancy rack graphics.Five racks of a cloud GPU cluster that runs ND H100 v5-series VMs. SourceThere are also teams that have no analogue in the traditional HPC world, like those whoare responsible for things ranging from the smart NICs and software-defined networks to profits and losses. This iswhat keeps things interesting; I can just as easily spend an hour reviewing benchmark results from the latest GPUwith my teammates as I can learning how the control systems for liquidheat exchangers affect system reliability or data centersafety. When things are quiet and no fires are burning, going to work can sometimes feel like going to a bigplayground full of HPC and HPC-adjacent technology.Don't get me wrong; it's still a job, and there are still unpleasant tasks anduncomfortable situations. Working at a cloud provider means a lot of processes are designed to be slow and steady,and some teams struggle to understand why anyone would want to reboot every node in a cluster at once--such an event would be a massive outage in general-purpose cloud! But working in an HPC organization means that when thesesituations arise, I'm no longer the odd HPC guy--I'm on the odd HPC team.What does industry do better than the Labs?AccountabilityOrganizational planning happens twice a year,and this planning is the time when teams all get on the same page about whatwork to prioritize in the next six months (a semester). Teams coordinate dependent work with each other,trades horses on what the priority of each request is, and at the end of planning, have committed agreements about what workwill be done in the next semester. The progress on that work is tracked throughout the semester, delays andinterrupts are accounted, and there's an escalation path up through the ranks of management and leadership ifpriorities cannot be agreed upon by individual teams.The DOE Labs operate much more loosely in my experience. There, people tendto work on whatever pet projects they want until they lose interest. If a project is funded by a research grant, there are loosedeliverables and timelines (write X papers per year), but at the end of the day, nothing really bad happens if thework progresses slowly or its quality is poor. There's no penalty if a research grant results in a piece of softwarethat nobody uses or a paper that nobody reads. The value of the work is largely intellectual, and as a result, it's perfectlypossible to have a long career at a DOE lab, churning out papers and software, that lacks anylasting impact.Tying money to the value of work can make accountability much more black and white. If you pay a team of engineers amillion dollars a year to develop a new service that only increases revenue by a million dollars a year, thatservice is going to be scrutinized every time prioritization happens. Is there a way to increase its revenue throughbetter features or better positioning? It'll be a product manager's job to go figure that out. If the answer comesback as \"no,\" then that service might be put on a shelf and its engineering team reassigned to work on somethingthat has a greater impact. Those engineers don't get to decide that they keep wanting to work on the service thathas limited demonstrable value.At the same time, managers are accountable for the wellbeing of their team and the teams underneath them. Allemployees fill out regular, semi-anonymized surveys on different aspects of job satisfaction, and the results ofthese surveys roll up all the way to the top of the company. If employees are disgruntled, their managers know it,and those managers' managers know it, and everyone up the chain is accountable for improving those scores. Sometimes that resultsin increased hiring so engineers don't feel overworked. Other times it means reorganizing people and teams to alignthem with the work they are good at performing. And if nothing works and a team's morale keeps declining, maybe it's because ofthe manager--and the manager gets replaced.Pace and decision makingBecause managers and leaders are accountable, I've also found them to be much more empowered to just do what theyfeel is the right thing to do. Whereas no big decision in the DOE Labs can be made without reviews, panels,strategic offsites, more reviews, and presentations to headquarters--all of which could add months oryears to a project--the direction can move on a dime because all it takes is one executive to sign off and acceptfull responsibility for the consequences of their decision. Getting the approval to staff up and pursue a good ideaoften requires only winning over one or two key people, not an army of feds in Germantown or an anonymous reviewpanel who isn't conversant in what you're proposing.And again, sometimes money makes decisions much easier to make. For example, a few people at ISC'24 asked me why wedidn't re-do the Top500 run for Eagle to beat Aurora since theSC'23 scoring was so close. The decision process can be as simple as this:According to the Top500 list's rawdata, Eagle achieved 561,200 TFlop/s using an Nmax of 11,796,480.Knowing that HPL's walltime is (flop count / Rmax) and HPL's flop count is (2/3 * Nmax^3), you can calculatethat the HPL walltime for this run was 1,950 seconds or 0.512 hours.The public list price for an Eaglenode (ND96isr H100 v5) is something like $60 an hour.The HPL run used 1,800 such nodes.Give the above, during the half hour it would take to run HPL, those same nodes could berunning a production workload which would have resulted in $58,000 in revenue. That is, the opportunity costof re-running HPL is at least $58,000 in lost revenue. In reality, it would take time to boot up and configure thecluster of virtual machines and do a few scale-up runs which would tie up the nodes for a couple hours, makingthis opportunity cost closer to a couple hundred thousand dollars.Is getting a marginally higher Top500 score worth a couple hundred thousand dollars if yourmachine is already listed and had its day in the sun? I don't need an executive to answer that question. But in thepublic HPC space, who's to say what the opportunity cost is? If HPL wasn't running twice a year on Frontier, are thedozen or so lattice QCD jobs that would be running instead worth a couple hundred thousand dollars?RelevanceI might be more vain than I thought when I worked for the government, because I really enjoy being able to talk about the work that I dowith the general public now. When people ask, \"What work do you do?\" and I respond with, \"Have you ever heard of Copilot orChatGPT?\" there is almost always a conversation that follows. People may not really understand how artificial intelligence andlarge language models work, but they've played with those technologies and have opinions and questions. Sometimes the conversation is about big-picture stuff like \"will AI take over the world?\" At other times it's specific like \"what do you think aboutAI's effect on global climate change?\" Because I am steeped in all aspects of AI in my day-to-day work, I canusually speak intelligently about any dimension of the AI industry when my neighbors ask.Every blog post these days needs at least one AI-generated picture, so here is a picture generated by DALL-E that \"captures the essence of explaining AI concepts to neighbors in a friendly, approachable setting.\" But more poignantly, my team directly supports the supercomputers that trained the model that generates these pictures.This was a much bigger challenge when I worked in the public sector. When I told people that I worked at Lawrence BerkeleyNational Lab, nobody knew what I was talking about half of the time. The other half of the time, people would think I worked onnuclear weapons because Lawrence Livermore National Lab has a confusingly similar name and geography. And if theconversation ever got as far as what people did on the supercomputers I supported, it would rapidly tail off onceall parties (including me) realized that cosmological hydrodynamics and quantum Monte Carlo don't really make for great conversation since they don't touch people's everyday lives.This isn't to say that the work done at the Labs isn't important. But the general public doesn't understand it, andto a large degree, doesn't really care about it. I realize that being able to impress your neighbors with what youdo isn't at the top of the list of most people's job requirements, but I get a lot of satisfaction out of it.Technically: securityHPC doesn't really worry about cybersecurity. Every HPC center has a security group and does scans and threatmodeling, but at the end of the day, the security practices on all the largest supercomputers in the public sectorare roughly the same as they were twenty years ago. Users ssh into a login node, and once you're inside, you haveaccess to everything. You can see everyone else who's logged in, you can see everyone who chmodded their homedirectory to be +777, and the only thing separating you from everyone else is the Linux kernel. Passwordless ssh iseverywhere, and often times, passwordless ssh for the root user is everywhere.This does not fly with paying commercial HPC and AI customers in the cloud who use supercomputing to develop betterproducts faster than their competitors. For example, both Arm and AMD have publiclystated that they perform a lot of their silicon design simulations using HPC in the cloud. What would happenif both AMD and Arm used the same cluster and one accidentally made their project directory world-readable? Shoulddomain scientists' understanding of how POSIX file permissions work really be the last line of defense against anext-generation CPU or GPU's specs being leaked to the competition?I had to quickly learn about modern security practices when I started doing HPC in the commercial cloud out ofnecessity. I'm still nowhere close to being a security expert, but two years has been long enough for me to nowcringe when I talk to my colleagues in the traditional HPC community about how they protect against threats. It'snot really their fault that most of the HPC community hasn't adopted modern practices, because the tools andpractices required to do it right aren't easy to set up, automate, and maintain from scratch.For example, basic LDAP is a short path to allowing users to log into a cluster's nodes, but if those users also needto authenticate themselves to REST services that support an HPC workflow across multiple clusters, you have to start building a Rube Goldberg machine of software on top of LDAP. Similarly, sticking every user on their own overlay network is great to limit the blast radius of acompromised account. However, automating the configuration of VXLAN tunnel endpoints as nodes get allocated and deallocated tojobs requires a lot of fancy orchestration that is either very complicated to build and maintain yourself or veryexpensive to buy and maintain. As a result, HPC just accepts the risk. Cloud hasfigured all this out though, and the price of providing this security infrastructure is included in the cost ofcloud-based supercomputers.But the pay is good, right?Like I said before I left the public sector, my base salary iscomparable to what I got at the lab. It's actually gotten less competitive because all salaries were frozen when I was first eligible for a raise. So, after considering the effects of inflation, my paycheck is a little lower than what it was in the government two years ago.What's different is the bonus structure which simply does not exist in the government or university world. For thosewho aren't familiar with how bonuses work in the tech industry, I'll share how it works for me:In the first year, I was awarded two signing bonuses: one in cash, one in stock. Half of the cash bonus was paidout up-front, and the other half was paid out after I had been there a year. The stock grant cannot be touchedduring the first year because it had a one-year \"cliff.\"On my one-year anniversary, I got the second half of my cash signing bonus, and my signing stock grant began\"vesting.\"After a year, I was also eligible for an annual performance-based raise, cash bonus, and stock bonus.Because of the economy, my annual raise was zero.The cash bonus was paid out in a lump sum, similar to my cash signing bonus.The stock bonus was awarded all at once but follows a multi-year \"vesting schedule\" which means I am onlyactually given fractions of the total award over time. However, these bonuses don't have a \"cliff\" and beginvesting immediately.Every year thereafter, I am eligible for an annual raise, cash bonus, and another stock bonus.The way stock bonuses work was the least intuitive part to me, but since it's such a significant part of total compensation, it's worth spellingout for anyone who's considering an offer that includes this:Stock bonuses are defined in terms of dollar values. For example, let's say I got a signing stock bonus of $1000with a one-year cliff that vests quarterly (every three months) over five years.On the day that stock bonus is awarded, my employer converts that $1000 value into company stock based on themarket value that day. If stocks are $50 per share, I am awarded 20 shares. My employer hangs on to those shareson my behalf, so I can't actually do anything with them yet.Since I have a five-year vesting schedule and the award vests quarterly, my shares will vest twenty times (fourquarters, five years). Coincidentally, since I have 20 shares, I will get one share per quarter.However, because I have a one-year cliff, I get all four quarters of my first year at my one-year anniversary.So, four shares should appear in my brokerage account on my one-year anniversary. Once a share is in mybrokerage account, I can do whatever I want with it (like sell it immediately!)Every quarter thereafter, one more share vests and appears in my brokerage account.Assuming I get a stock bonus as part of my overall annual bonus, this means that stockawards pile up and vest every year. This is tricky for two reasons:Although my initial stock award was $1,000 in the above example, that amount was converted to stock the day itwas awarded. Assuming I am doing a good job and increasing the value of my employer's stock, the value ofthose shares will increase while they're vesting. This means by the time the first four shares of my awardvested at my one-year anniversary, they were worth more than the $50 per share they represented when they wereawarded. More broadly, the value of a stock bonus tends to increase over time, making the true cash value of a$1000 stock bonus worth a lot more than $1000 by the time it completely vests.Every year's stock award comes with its own multi-year vesting period, which means at any given time, I havemultiple years' bonuses all vesting at once. This also means that at any given time, I have a bunch of unvestedstock that's worth a lot of money that I can't yet spend. If I quit my job though, all these unvested sharesvanish into thin air.These two factors make up the golden handcuffs that people often talk about in industry.The longer I stick around, the more unvested stock I have hanging over my head, and it usually becomes increasinglyvaluable (yet inaccessible!) over time. The reality is that if you've put in a few years in Big Tech, you might haveyears' worth of base salary tied up in unvested stock that all goes away if you quit.The end result is that although base salary is competitive with what you can make in a government HPC facility, there's a significant cash bonus that falls out of the sky once a year, andan appreciable amount of stock appears in your brokerage account every couple of months which you can turn aroundand sell for more cash. Depending on seniority and performance, these bonuses can add up to a significant fractionof base salary.Finally, the above is consistent with what I've seen firsthand at two companies in Big Tech but may be different based on the role and the company. For example, field-facing roles in sales and support may be completely different beasts, and private companies and startups load things differently due to the value of equity.How's work-life balance?It hasn't been different than working in the government. Just like at a lab or university, some peoplework around the clock while others stick pretty close to the standard workday. There may be a higher concentrationof Type A personalities who put in a lot of time in Big Tech, and this may pressure others to keep up and also putin long hours, but there's rarely been an occasion where a manager expects staff to routinely work nights andweekends. Doing so would probably result in negative employee satisfaction scores which would roll up and eventuallyhave to be addressed.Of course, there are cases where working odd hours is required to get the job done. BecauseI work for a global organization, I've had to get up early to meet with teams or customers in Europe. I've also hadto stay up late to meet with teams or customers in Asia. And in some particularly annoying days, I've had to do bothand wind up working from 5am to 8pm. But I never felt that I had no choice in the matter; I pulled these hoursbecause it was the right thing to do at the time. And I don't see this as being too different from the days when I'dwork sixteen-hour days, seven days a week, for the entire month of March to put together a paper for SC. Or dayswhen I'm at SC and am preparing talks, meeting with partners, and otherwise hustling from 8am to 1am for five daysstraight.One big difference is the fact that my employer offers discretionary time off (\"unlimited vacation\"). This is a divisive topic in industry, but I see it as a positive for work-life balancebecause it underscores an emphasis on outcomes rather than output. I can take an afternoonoff or enjoy a long weekend with little fanfare, because productivity is infinitely more valuable that presence. Aslong as I do what needs to get done, I don't have to worry about timing vacations to ensure I am banking enough timeoff in between.Do you miss anything about working at the lab?Absolutely. There are a bunch of appealing things about working in a DOE lab (or an NSF center)that I've had to give up since coming to industry.Freedom to have an off dayRight before I finished graduate school, I hada conversation with Professor Edmund Webb soonafter he became a professor after a decade-long career at Sandia National Labs about life at the Labs. He said that,after becoming a professor, he lost the ability to just close the door to his office and focus onsomething he needed to get done for a day. I didn't really grasp what this meant at the time, but I totally get it now. TheDOE might be one of the few places where you can take a day--maybe even a week--and just close your door toeverything else that's going on around you to focus on what you want to do. In the case of professorship, there's always students requiring attention; in industry, it's customers and partners.I think this difference results from two factors: very few things in publicHPC are very urgent, and the Labs are stocked full of independent, free-thinking Ph.D. types. There's rarely apenalty if something is late by a day (or two years! Remember when Aurorawas called \"A21?\"), but there can be huge payoff in prestige if one of your wacky side projects turns out to besomething useful (this is how Shifter came to be). By comparison, working at a giant corporation often means there are a bunch of interdependencieson others, and the odds of any one of your 200,000 coworkers sending you a Teams message asking for help is just a lot higher than it is at a 70-person supercomputer center. The culture is much more team-oriented, and being a one-person army isn't incentivized as much.TravelPart of my job within the DOE complex was to go around the country (and the world) and be smart, and secondarily,show that my lab hired smart people and did smart things. If headquarters wanted to make sure that the supercomputerthey were about to spend $500M on was technically sound, I'd sometimes get invited to go sit in on a review and tryto poke holes in the design. If a European HPC project wanted to ensure they were including a global perspective onsome dimension of future HPC strategy, I'd sometimes get invited to give a talk about how I view the world of data.And if these reviews and workshops happened to be in awesome places around the world--oh well!I feel a lot more self-conscious about requesting approval to attend these sorts of boondoggles as an engineer nowbecause the first question I have to answer is, \"Is this trip business critical?\" If there's a direct line of sightbetween me giving a talk at a workshop and a specific business strategy, I can say \"yes\" with a straight face. But it'shard to accept an invitation to fly off to Switzerland to give a 30-minute talk when I know that my attendance isn'tgoing to move any needles.OpennessJust like it's no longer my job to travel the world and just be smart, it's not my job to write about the work that I(or my team) does. I miss writing papers and giving technical talks, because the process of putting togethercoherent thoughts around a technical topic is one of the ways I really come to understand it. There's also a lot ofreally wild ideas that we're pursuing at scale that the scientific computingcommunity has never considered, but there are two factors that work against being open about these things:In terms of prioritization, my time is always better spent solving problems, or at least documenting them forinternal audiences who fully grasp the context around them, than writing about them in a way that the rest ofthe world can understand. It's hard to justify the time to write a retrospective or a study unless there's astrategic advantage behind it.The customers I support typically do not want the world knowing what they're doing. There is an AI arms racehappening right now, and having the technical sophistication to utilize massive-scale supercomputers effectivelyis a competitive advantage. In the traditional HPC community, only national security is comparable to the levelof secrecy involved, and none of the intelligence agencies are openly contributing to the state of the art inHPC either.So instead of making conference papers and presentations, these days I make more internal papers and presentations.I'm trying to figure out ways to publish interesting technical anecdotes on my website (for example, I maintain a collection of LLM training requirements as I am exposed to them), but it's a lot of extra work to disentangle the proprietary bits from my work notes to do this.Related to openness is also freedom to speak my mind in public forums. I had the most latitude to blast myopinions out on to the Internet when I was still early in my career and nobody listened to me, but I've had to getprogressively less opinionated over the years. At this point, I abide by a written corporate social media policywhich, although very reasonable in what it requests (don't slander competitors, always be transparent about who employs you), it stops me from commenting on news as much as I used to since so many techcompanies qualify as competitors in some dimension.Would you still have left knowing what you know now?Yes. I still stand by just about everything I wrote in my original blog post; at the time, I just needed a change, and Ifound the change that I was looking for. Without immersing myself in the world of cloud, I would havenever learned about virtualization, physical infrastructure, or modern security to the degree that I have. And the fact that Istumbled into what has become one of the leading companies in AI at the dawn of generative AI was an extremely luckycoincidence.However, this doesn't mean that I now turn my nose up at doing HPC in the public sector.There are many unique aspects to working at a DOE lab or NSF center that have no parallel in industry. I also believe that I amthe sum of the experiences that led me to where I work today, and I would never have gotten the opportunity to writethis retrospective if I didn't learn everything I did working in the DOE and NSF.And perhaps above all else, there is something attractive about public service that I haven't beenable to shake in the last two years. I still dial in to ASCACmeetings to see what the world of public HPC and scientific computing is thinking and doing, and I still tryto contribute time and attention to working groups like NITRD's MAGIC. I write lengthy blog posts in a futile attempt to caution the leaders in public-sector HPC againstrejecting AI workloads in commercial clouds as HPC. And every time I learn some slick way we deal with hard technological or sociological issues at work, I still file it away in the \"good ideas for when I goback\" folder in the back of my mind.I don't have any near-term plans on going anywhere though. Like I said before, there arestill plenty of days when dialing into work is like going to the playground. Amazing things are happening in theworld of HPC infrastructure at scale now that the world is pouring money into AI, and the rate of scale andinnovation is no longer constrained to 40 MWand $500M persupercomputer like it was when public-sector HPC was setting the bar for leadership. There is a whole new exciting world of challenges and possibilities when you start thinking about building supercomputers that consume hundreds of megawatts of power.Like I wrote two years ago, I don't think any government has the appetite to build data centers for scientific computing that are larger than today's 50 MW exascale facilities. This means that government HPC centers will never have a reason to explore the exciting world of 100+ MW supercomputers or work on the wacky problems that arise at that scale. Consequently, the biggest and most challenging problems in HPC--at least in terms of infrastructure and systems design at scale--are becoming unique to industry, not public HPC.I got into HPC because I enjoy working on large, complex systems. Considering where I am at this stage of my life, what I want to accomplish in the rest of my career, and what gets me out of bed in the morning, I feel like I wound up in the right place for now. I have no regrets.",
            "content_html": "<p>June 2024 marked two years since I <a href=\"http://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">left my job at one of the world's most prestigious government HPC centers for a job in one of the world's largest technology corporations</a>. In that time, the world of HPC has changeddramatically; just six months after I started, ChatGPT was released and triggered a gold rush in AI that is now overshadowing traditional scientific computing. This shift brought about massive HPCdeployments led by hyperscalers, challenging the long-held belief that only national governments coulddeploy and operate world-leading supercomputers. <a href=\"http://blog.glennklockwood.com/2024/05/isc24-recap.html\">My experiences atISC'24 this past summer</a> made clear to me that the traditional HPC community is now rethinking their role the industry, and some individuals who built their careers in public HPC are revisiting theirassumption that world-class HPC systems are limited to the public institutions that havehistorically dominated the <a href=\"https://www.top500.org/lists/top500/list/2024/06/\">top of the Top500 list</a>. Ihad no idea things would unfold this way when I left my job at NERSC back in 2022, and I've been remarkably lucky tonow be a part of the largest forces driving this huge shift in HPC.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure style=\"display: inline-block; margin-left: 1em; margin-right: 1em;\"><figcaption style=\"font-size: 14px; margin-top: 5px;\">One of my new offices. Nicer than my old government office, and it has free food, but it's a ninety-minute drive each way.</figcaption></figure></div><p>In the spirit of openness and helping others who are facing similar career decisions, I thought I would follow up on my <a href=\"http://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">Life and leavingNERSC</a> post by sharing how my professional journey from DOE HPC into cloud HPC has been going. I'll first explain the path I've traveled over these past two years, then answer some of the mostcommon questions I've been asked about this transition.</p><p>As a forewarning, this is not a typical technology-focused post, and most of this might be obvious to people who already work in Big Tech. Here are the questions on which I reflected:</p><ol><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#first-two-years\">What happened during my first two years in Corporate America?</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#what-i-do\">So what do I actually do?</a><ol><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#storage-product-management\">Storage product management</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#hpc-ai-development\">HPC/AI development</a></li></ol></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#am-i-happy\">Am I happy with my decision and the new job?</a><ol><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#am-i-happy-broadly-yes\">Broadly, yes</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#long-time-no\">But for a long time, no</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#finally-yes\">Finally, yes</a></li></ol></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#industry-does-better\">What does industry do better than the Labs?</a><ol><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#accountability\">Accountability</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#pace-and-decision-making\">Pace and decision making</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#relevance\">Relevance</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#technically-security\">Technically: security</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#pay-good\">But the pay is good, right?</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#work-life-balance\">How's work-life balance?</a></li></ol></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#what-i-miss\">Do you miss anything about working at the lab?</a><ol><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#freedom-to-have-an-off-day\">Freedom to have an off day</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#travel\">Travel</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#openness\">Openness</a></li></ol></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#regret-decision\">Would you still have left NERSC knowing what you know now?</a></li></ol><h2 id=\"first-two-years\">What happened during my first two years in Corporate America?</h2><p>I published my <a href=\"http://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">Life and leaving NERSC</a> blog post on aThursday, which was my last day working at NERSC. The following Monday was my first day at the new job, and beinghired as 100% remote, it didn't feel that different; I was just booting up a Lenovo laptop (yuck) instead of aMacBook, using Teams and Outlook instead of Slack, GSuite, and Zoom, and that sort of thing.</p><p>However, the job was undeniably different; whereas I used to be an engineer at NERSC, I was hired to be a \"Principal ProductManager\" within the cloud storage organization which was responsible for all object, disk, and file storageservices. Although my title was \"product manager,\" I wasn't a people manager, andI didn't manage any specific storage products. Rather, my responsibility was to act as an HPC-focused overlayacross all cloud storage services, and my job was to represent the interests of HPC users to all the people who did manage specific storage products. I didn't define product or feature roadmaps myself, but I could help those responsible for each product or service understand how to shape their roadmaps to benefit HPC workloads.</p><p>I struggled in this position for a variety of reasons, so after I gave the new rolean honest six to nine months, I decided that being a storage product manager just wasn't a good fit for me.Unfortunately, I reached this decision after the <a href=\"https://www.nytimes.com/2022/07/21/business/yield-curve-inversion.html\">yield curve inverted</a> and<a href=\"https://www.theverge.com/2023/1/18/23560315/microsoft-job-cuts-layoffs-2023-tech\">mass-layoffs and hiringfreezes</a> were implemented, so there weren't a lot of places to go other than back to a government lab.Although I wasn't thriving as a storage product manager, I did have allies that helped me navigate myday-to-day struggles, and I decided to wait until more opportunities opened up and learn as much aboutproduct management as I could in the meantime.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure style=\"display: inline-block; margin-left: 1em; margin-right: 1em;\"><figcaption style=\"font-size: 14px; margin-top: 5px;\">The yield curve inverted a month after I started my new job. Not great timing.</figcaption></figure></div><p>After a little over a year as a storage product manager, a new engineering role opened up within a sister team in ourHPC/AI infrastructure organization. After discussing the needs and nature of the work with the hiring manager, I applied for the job, went through theinterview process, and was eventually given a verbal offer to join his team in June 2023. Unfortunately, the globaleconomic outlook was still uncertain, and I wound up sitting in a holding pattern (as a storage productmanager) from June 2023 to November 2023. It wasn't until the week of SC'23 that I finally got the written offerletter, and I spent December wrapping up loose ends within the storage organization.</p><p>On January 2, 2024, I began my new (and current) role within the company. The move was completely lateral, but I changed job titles from \"Product Manager\" to \"Software Engineer,\" and Ichanged organizations from storage to specialized compute.</p><p>I say all this because my experiences in making the professional transition from government HPC to cloud HPC are colored by the fact that I really changed jobs twice. I've had both product management and engineering/development roles, and I've been in both storage and HPCorganizations.</p><h2 id=\"what-i-do\">So what do I actually do?</h2><p>I've had two very different roles within the same orbit of HPC/AI infrastructure, so I'lldescribe them separately to give you a sense of the breadth of HPC roles possible.</p><h3 id=\"storage-product-management\">Storage product management</h3><p>As a <b>storage product manager</b> (PM), I was an inch deep but a mile wide on every storage service, every commercial HPC workload, andall the ways in which those two could touch each other. I'd guess that only 25% of my day-to-day work required deepexpertise in HPC; the remainder was either business-centric or required only understanding HPC in broad strokes. Thiswas quite unlike the things I'd done earlier in my career in the public sector, since there's not an equivalent towhat a product manager does within the DOE Labs.</p><p>For example, I spent a lot of my time as a storage PM explaining the basics of HPC I/O to different teams within thecompany. When most cloud people think \"storage,\" they are really thinking about either enterprise storage (thingslike virtual disks for virtual machines) or content distribution (think serving up content for web apps). Theconcept of hundreds or thousands of VMs all writing to the same place at the same time is standard practice in theHPC world, but in the cloud world, this is a <a href=\"https://www.microsoft.com/en-us/security/business/security-101/what-is-a-ddos-attack?msockid=2008901357a56c4518b3840856e96dad\">DDoSattack</a>. Since my organization was responsible for all storage, not just HPC storage, there were a lot ofpeople who simply never had to think about the challenges that HPC people take for granted, and it could be challenging (as the new guy) to convince seasoned cloud storage PMs that some workloads legitimately need hundreds of gigabytes per second of bandwidth.</p><p>As a PM, I also wound up doing a fair amount of business reporting. For example, object storage is used by all manner of cloudcustomers, so prioritizing features that specifically help HPC customers required understanding how many HPCcustomers actually used it. How do you define whether a workload is really an HPC workloads or not? InDOE, we'd waste hours debating stuff like this for no real purpose, but when I became a product manager, I had todefine this to make the business case that we needed to develop a certain feature that would only be used by HPC workloads.</p><p>Finally, I did a fair amount of actual product and project management work. Get on the phone with a customer,write down what they do, and turn those into requirements. Do that a bunch of times, then synthesize a more generalrequirements document. Review it with leadership. Get approval to assign developers to work on the features to meetthose requirements. Ask other teams to develop features you need for your feature. Negotiate with everyone ondevelopment priorities in the next six months. Track progress of the development team. Produce demos to show thatprogress is being made. Present progress to leadership. That sort of thing. It's similar to being a PI on a researchgrant, except I had customers, dependencies, and ultimate accountability.</p><p>As far as technical work, a lot of it revolved around meeting customers and internal partner teams where they were interms of their knowledge of HPC. I did a fair amount of technical marketing; I would come up withthe ways people should think about combining storage services together in their HPC workflows, then figure outhow to communicate that to audiences with vastly different levels of technical understanding. For example, I didn'town our <a href=\"https://learn.microsoft.com/en-us/azure/azure-managed-lustre/amlfs-overview\">Lustre product</a>, <a href=\"https://learn.microsoft.com/en-us/azure/storage/blobs/storage-blobs-introduction\">object storageproduct</a>, or <a href=\"https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/high-performance-compute/hb-family\">HPC CPUnode product</a>, but I owned <a href=\"https://techcommunity.microsoft.com/t5/azure-high-performance-computing/azure-managed-lustre-not-your-grandparents-parallel-file-system/ba-p/3889946\">thestory around how we envisioned all three services worked well together</a>. This meant I would create slides andnarratives around this, then present them to anyone from our sales teams (who often had limited HPC-specific experience) tothe world's leading HPC centers.</p><p>I also sometimes helped development teams accurately test their storage systems against HPC workloads. For example,when ChatGPT exploded, everyone wanted to know how well their storage service worked for training large languagemodels. I would talk to the engineers who trained LLMs, infer what their I/O patterns would be based ontheir description of how they did training, then design a benchmark that our developers could follow toemulate that LLM training workflow. Since I understood both the workload and the storage technology, it was often faster for me to translate between AI engineers and storage engineers rather than have them speak directly.</p><h3 id=\"hpc-ai-development\">HPC/AI development</h3><p>As an <b>HPC/AI engineer</b>, my work is a lot more technical and focused. I'm on a \"white-glove support team\"that works directly with large, strategic customers in HPC and AI, so rather than working with dozens of customers and connect them to dozens of storage technologies, I work with one or two customers and the specific technologies on which they build their HPC or AI clusters. Because of this, I'd wager 95% of my day-to-day work istechnical.</p><p>I don't spend much time in a terminal by virtue of my relative seniority. Instead, I sit in on a lot of internal meetings and represent the perspective of our strategic HPC and AI customers. For example, if we are trying to decide which CPU toinclude in our next HPC-optimized CPU node, I might work with our benchmarking engineers to develop a representativebenchmark and then interpret the results with the node's product managers. I'm not the person running the benchmarkmyself; instead, I might ask hard questions that the customer might ask, help decide the next experiments to run,and backstop our engineers if the customer starts poking too many holes in the work.</p><p>I also function as a system architect at times; if a customer shows up with unusually large or complexHPC system requirements, I'll help translate the customer requirement (e.g., \"We need 10 TB/s of storage bandwidth)for individual product teams (e.g., \"they will be using N compute nodes and accessing storage via a network withthis topology and tapering, likely running an application that has this pattern, ...\"). This often requires understanding what the compute, network, <i>and</i> storage product teams are doing and being able to explain it all inwhatever terms each team understands. I also wind up sitting in on customer meetings and asking critical questionsso that we can make informed design tradeoffs.</p><p>I do write code, but no more than I did when I was a system architect at NERSC. For example, I might pull PDUtelemetry from across a data center to help determine if oversubscribing the power for a future cluster wouldimpact workloads. The code itself is pretty straightforward statistical analysis, but interpreting it requires anunderstanding of a bunch of things ranging from the workload running on the nodes to how nodes are distributedacross PDUs, racks, rows, halls, and buildings.</p><p>The remaining 5% of my work is not very technical and involves things I opt into because it's interesting orthe right thing to do. This might be spending time providing historical context for a business strategy document or showing up at a meeting to help explain the customer perspective to a finance or sales team.</p><h2 id=\"am-i-happy\">Am I happy with my decision and the new job?</h2><p>Yes, no, and yes.</p><h3 id=\"am-i-happy-broadly-yes\">Broadly, yes</h3><p>I am glad I made the decision to leave NERSC and take on a job in Big Tech for a couple ofhigh-level reasons.</p><p>As a product manager, I learned a lot about how businesses and corporations work to adegree that I never did when I worked at a startup and I never would have if I stayed with the government. Not onlydo I now know what the difference between gross and operating margin is, but I <i>get</i> it because I've had tobuild <a href=\"https://www.investopedia.com/terms/c/cogs.asp\">COGS</a> and pricing models that could sustain and grow a new product. I know exactly how toprice cloud services (or any product or service, really) and where that money goes. I now pay much more attention toquarterly earnings reports, and I have a more confident opinion on what different elements of these reports sayabout a technology company's trajectory. This has equipped me with what feels like a much more completeunderstanding of the HPC industry overall.</p><p>I'm also glad to work at a company that generally tries to do the right things. Itis investing heavily towards being <a href=\"https://blogs.microsoft.com/blog/2020/01/16/microsoft-will-be-carbon-negative-by-2030/\">carbonnegative</a> (rather than just buying carbon offsets) while others are <a href=\"https://www.tomshardware.com/tech-industry/artificial-intelligence/elon-musks-new-worlds-fastest-ai-data-center-is-powered-by-massive-portable-power-generators-to-sidestep-electricity-supply-constraints\">burning gas inefficiently</a> in a race to be #1. It also <a href=\"https://givebutter.com/blog/companies-that-match-donations\">matchesevery donation I make to 501(c)3 nonprofits</a> which is a huge benefit that matches up with the ways inwhich I try to share my good fortune with others. And it beats employees over the heads with a strong, positive <a href=\"https://careers.microsoft.com/v2/global/en/culture\">corporate culture</a> which holds managers and leadersaccountable for the wellness of their employees. These sorts of things don't meaningfully exist in government, andthere are a lot of big corporations out prioritize short-term profits over the longer-term benefits that come from investing in sustainability and philanthropy.</p><h3 id=\"long-time-no\">But for a long time, no</h3><p>However, I was unhappy for my first eighteen months.</p><p>I took a gamble on storage product management being as interesting and fulfilling asengineering when I decided to step into this new job, and I lost that bet. I quickly came to realize that there's a big difference betweenbeing a <u>storage person in an HPC organization</u> and being an <u>HPC person in a storage organization</u>.</p><p>When I worked in an HPC organization like NERSC, I was used to being the odd man outbecause parallel storage is a complicated topic that most HPC folks don't <i>really</i> understand. Despite that, everyone is still generally like-minded and appreciates the same things; everyone knows what MPI and InfiniBandare, and everybody knows what a checkpoint and restart might look like.</p><p>Conversely, when I worked in a storage organization, I was an odd man out because nobodyreally understood HPC. The average engineer only had a vague notion of what MPI orInfiniBand accomplished. If you don't understand that MPI is what lets hundreds of servers all work on the samedistributed problem at once, it's easy to forget that an MPI application will also cause hundreds of servers toall write data at once. And if you've never used an MPI barrier, it's hard to internalize the fact that the wholeapplication stops until the slowest process finishes writing.</p><p>Instead of worrying about tightly coupled applications, I realized that storage peopleworry about data availability and durability above all else. After all, storage's #1 job is to not lose data. Incontrast, it's not unusual for an HPC user to have hundreds of terabytes of data vanish because they forgot to copyit off of scratch before it got purged. This sharp difference in priorities--data durability versusperformance--causes friction, because at the end of the day, what's good for HPC (high bandwidth and low latency) isusually bad for storage (high durability and availability).</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure style=\"display: inline-block; margin-left: 1em; margin-right: 1em;\"><figcaption style=\"font-size: 14px; margin-top: 5px;\">The landscape of storage for HPC and storage for enterprises as I see it. If you care about one but work with people who care about the other, expect friction.</figcaption></figure></div><p>These are technological differences, but they result in a persistent, elevated level of latent stress that never goes away. People tend to worry about the things they understand, and people tend to ask for helpabout the things that worry them. What this meant for me is that I spent a lot of time focusing on things thateveryone understood (like market trends, revenue, and general indicators of performance) instead of hard problemsunique to large-scale HPC. And because I was never solving the hard problems, I never got the gratification of feeling like I accomplished something that, as I learned, is an important motivator to me.</p><p>To be clear, I realize that I made the decision to focus on problems that other people brought merather than carve out a space for me to work on the problems I felt were important. I'm sure that someone who wasmore tenacious and unafraid to pursue challenges that nobody else understood would have a very different experience as a PM. But after about a year, I realized that what I value and enjoy doing just isn'taligned with what a successful storage PM needs to be successful. I realized I didn't want to keep doing what I was doingfor another five years, so I decided to stop.</p><h3 id=\"finally-yes\">Finally, yes</h3><p>I quite enjoy my role in HPC/AI engineering and development now, as it's similar towhat I used to do in the DOE. I have to learn about how different hardware, software,and systems work, and I have a lot of room to focus on challenges that play to my strengths and interests.For example, I love engaging with the HPC community, and my job still allows me to go out to the big HPC conferences to do that. At the same time, I also like getting into the guts of system behavior, and I still get to spend at least an hour or two a week doing something quantitative.</p><p>My day-to-day is also steeped in that familiar feel of working in an HPC organization.Every cluster has a name that gets bandied about in meetings, and they have the same familiar challenges--fabric disruptions, firmwareupgrades, flaky nodes, and the like. The standard responsibilities are also all there; some teams perform systemadministration, others support users, and some of us focus on future system designs. But the cluster names aren't nearly as creative as those in the public sector (<a href=\"https://www.top500.org/system/180236/\">Eagle's</a> real name sounds like a serial number). And they look pretty boring too; there are no fancy rack graphics.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure style=\"display: inline-block; margin-left: 1em; margin-right: 1em;\"><figcaption style=\"font-size: 14px; margin-top: 5px;\">Five racks of a cloud GPU cluster that runs ND H100 v5-series VMs. <a href=\"https://www.youtube.com/watch?v=ntKZ5CibuIQ\">Source</a></figcaption></figure></div><p>There are also teams that have no analogue in the traditional HPC world, like those whoare responsible for things ranging from the <a href=\"https://techcommunity.microsoft.com/t5/azure-infrastructure-blog/announcing-the-general-availability-of-azure-boost/ba-p/3981384\">smart NICs</a> and software-defined networks to profits and losses. This iswhat keeps things interesting; I can just as easily spend an hour reviewing benchmark results from the latest GPUwith my teammates as I can learning how the control systems for <a href=\"https://news.microsoft.com/source/features/ai/in-house-chips-silicon-to-service-to-meet-ai-demand/\">liquidheat exchangers</a> affect system reliability or <a href=\"https://www.osha.gov/noise/standards\">data centersafety</a>. When things are quiet and no fires are burning, going to work can sometimes feel like going to a bigplayground full of HPC and HPC-adjacent technology.</p><p>Don't get me wrong; it's still a job, and there are still unpleasant tasks anduncomfortable situations. Working at a cloud provider means a lot of processes are designed to be slow and steady,and some teams struggle to understand why anyone would want to reboot every node in a cluster at once--such an event would be a massive outage in general-purpose cloud! But working in an HPC organization means that when thesesituations arise, I'm no longer the <i>odd HPC guy</i>--I'm on the <i>odd HPC team</i>.</p><h2 id=\"industry-does-better\">What does industry do better than the Labs?</h2><h3 id=\"accountability\">Accountability</h3><p>Organizational planning happens <a href=\"https://devblogs.microsoft.com/azure-sdk/planning-2021/\">twice a year</a>,and this planning is the time when teams all get on the same page about whatwork to prioritize in the next six months (a <i>semester</i>). Teams coordinate dependent work with each other,trades horses on what the priority of each request is, and at the end of planning, have committed agreements about what workwill be done in the next semester. The progress on that work is tracked throughout the semester, delays andinterrupts are accounted, and there's an escalation path up through the ranks of management and leadership ifpriorities cannot be agreed upon by individual teams.</p><p>The DOE Labs operate much more loosely in my experience. There, people tendto work on whatever pet projects they want until they lose interest. If a project is funded by a research grant, there are loosedeliverables and timelines (write X papers per year), but at the end of the day, nothing really bad happens if thework progresses slowly or its quality is poor. There's no penalty if a research grant results in a piece of softwarethat nobody uses or a paper that nobody reads. The value of the work is largely intellectual, and as a result, it's perfectlypossible to have a long career at a DOE lab, churning out papers and software, that lacks anylasting impact.</p><p>Tying money to the value of work can make accountability much more black and white. If you pay a team of engineers amillion dollars a year to develop a new service that only increases revenue by a million dollars a year, thatservice is going to be scrutinized every time prioritization happens. Is there a way to increase its revenue throughbetter features or better positioning? It'll be a product manager's job to go figure that out. If the answer comesback as \"no,\" then that service might be put on a shelf and its engineering team reassigned to work on somethingthat has a greater impact. Those engineers don't get to decide that they keep wanting to work on the service thathas limited demonstrable value.</p><p>At the same time, managers are accountable for the wellbeing of their team and the teams underneath them. Allemployees fill out regular, semi-anonymized surveys on different aspects of job satisfaction, and the results ofthese surveys roll up all the way to the top of the company. If employees are disgruntled, their managers know it,and those managers' managers know it, and everyone up the chain is accountable for improving those scores. Sometimes that resultsin increased hiring so engineers don't feel overworked. Other times it means reorganizing people and teams to alignthem with the work they are good at performing. And if nothing works and a team's morale keeps declining, maybe it's because ofthe manager--and the manager gets replaced.</p><h3 id=\"pace-and-decision-making\">Pace and decision making</h3><p>Because managers and leaders are accountable, I've also found them to be much more empowered to just do what theyfeel is the right thing to do. Whereas no big decision in the DOE Labs can be made without reviews, panels,strategic offsites, more reviews, and presentations to headquarters--all of which could add months oryears to a project--the direction can move on a dime because all it takes is one executive to sign off and acceptfull responsibility for the consequences of their decision. Getting the approval to staff up and pursue a good ideaoften requires only winning over one or two key people, not an army of feds in Germantown or an anonymous reviewpanel who isn't conversant in what you're proposing.</p><p>And again, sometimes money makes decisions much easier to make. For example, a few people at ISC'24 asked me why wedidn't re-do the <a href=\"https://www.top500.org/system/180236/\">Top500 run for Eagle</a> to beat Aurora since theSC'23 scoring was so close. The decision process can be as simple as this:</p><p></p><ul><li>According to the <a href=\"https://top500.org/lists/top500/2024/06/download/TOP500_202406.xlsx\">Top500 list's rawdata</a>, Eagle achieved 561,200 TFlop/s using an Nmax of 11,796,480.</li><li>Knowing that HPL's walltime is (flop count / Rmax) and HPL's flop count is (2/3 * Nmax^3), you can calculatethat the HPL walltime for this run was 1,950 seconds or 0.512 hours.</li><li>The <a href=\"https://azure.microsoft.com/en-us/pricing/calculator/\">public list price</a> for an Eaglenode (ND96isr H100 v5) is something like $60 an hour.</li><li>The HPL run used 1,800 such nodes.</li></ul><p>Give the above, during the half hour it would take to run HPL, those same nodes could berunning a production workload which would have resulted in $58,000 in revenue. That is, the <i>opportunity cost</i>of re-running HPL is at least $58,000 in lost revenue. In reality, it would take time to boot up and configure thecluster of virtual machines and do a few scale-up runs which would tie up the nodes for a couple hours, makingthis opportunity cost closer to a couple hundred thousand dollars.</p><p>Is getting a marginally higher Top500 score worth a couple hundred thousand dollars if yourmachine is already listed and had its day in the sun? I don't need an executive to answer that question. But in thepublic HPC space, who's to say what the opportunity cost is? If HPL wasn't running twice a year on Frontier, are thedozen or so lattice QCD jobs that would be running instead worth a couple hundred thousand dollars?</p><p></p><h3 id=\"relevance\">Relevance</h3><p>I might be more vain than I thought when I worked for the government, because I really enjoy being able to talk about the work that I dowith the general public now. When people ask, \"What work do you do?\" and I respond with, \"Have you ever heard of Copilot orChatGPT?\" there is almost always a conversation that follows. People may not really understand how artificial intelligence andlarge language models work, but they've played with those technologies and have opinions and questions. Sometimes the conversation is about big-picture stuff like \"will AI take over the world?\" At other times it's specific like \"what do you think aboutAI's effect on global climate change?\" Because I am steeped in all aspects of AI in my day-to-day work, I canusually speak intelligently about any dimension of the AI industry when my neighbors ask.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><figure style=\"display: inline-block; margin-left: 1em; margin-right: 1em;\"><figcaption style=\"font-size: 14px; margin-top: 5px;\">Every blog post these days needs at least one AI-generated picture, so here is a picture generated by DALL-E that \"captures the essence of explaining AI concepts to neighbors in a friendly, approachable setting.\" But more poignantly, my team directly supports the supercomputers that trained the model that generates these pictures.</figcaption></figure></div><p>This was a much bigger challenge when I worked in the public sector. When I told people that I worked at Lawrence BerkeleyNational Lab, nobody knew what I was talking about half of the time. The other half of the time, people would think I worked onnuclear weapons because Lawrence Livermore National Lab has a confusingly similar name and geography. And if theconversation ever got as far as what people did on the supercomputers I supported, it would rapidly tail off onceall parties (including me) realized that cosmological hydrodynamics and quantum Monte Carlo don't really make for great conversation since they don't touch people's everyday lives.</p><p>This isn't to say that the work done at the Labs isn't important. But the general public doesn't understand it, andto a large degree, doesn't really care about it. I realize that being able to impress your neighbors with what youdo isn't at the top of the list of most people's job requirements, but I get a lot of satisfaction out of it.</p><h3 id=\"technically-security\">Technically: security</h3><p>HPC doesn't really worry about cybersecurity. Every HPC center has a security group and does scans and threatmodeling, but at the end of the day, the security practices on all the largest supercomputers in the public sectorare roughly the same as they were twenty years ago. Users ssh into a login node, and once you're inside, you haveaccess to everything. You can see everyone else who's logged in, you can see everyone who chmodded their homedirectory to be +777, and the only thing separating you from everyone else is the Linux kernel. Passwordless ssh iseverywhere, and often times, passwordless ssh for the root user is everywhere.</p><p>This does not fly with paying commercial HPC and AI customers in the cloud who use supercomputing to develop betterproducts faster than their competitors. For example, both <a href=\"https://www.synopsys.com/blogs/chip-design/eda-in-the-cloud-snug-2023.html\">Arm and AMD have publiclystated that they perform a lot of their silicon design simulations using HPC in the cloud</a>. What would happenif both AMD and Arm used the same cluster and one accidentally made their project directory world-readable? Shoulddomain scientists' understanding of how POSIX file permissions work really be the last line of defense against anext-generation CPU or GPU's specs being leaked to the competition?</p><p>I had to quickly learn about modern security practices when I started doing HPC in the commercial cloud out ofnecessity. I'm still nowhere close to being a security expert, but two years has been long enough for me to nowcringe when I talk to my colleagues in the traditional HPC community about how they protect against threats. It'snot really their fault that most of the HPC community hasn't adopted modern practices, because the tools andpractices required to do it right aren't easy to set up, automate, and maintain from scratch.</p><p>For example, basic LDAP is a short path to allowing users to log into a cluster's nodes, but if those users also needto authenticate themselves to REST services that support an HPC workflow across multiple clusters, you have to start building a Rube Goldberg machine of software on top of LDAP. Similarly, sticking every user on their own overlay network is great to limit the blast radius of acompromised account. However, automating the configuration of VXLAN tunnel endpoints as nodes get allocated and deallocated tojobs requires a lot of fancy orchestration that is either very complicated to build and maintain yourself or veryexpensive to buy and maintain. As a result, HPC just accepts the risk. Cloud hasfigured all this out though, and the price of providing this security infrastructure is included in the cost ofcloud-based supercomputers.</p><h3 id=\"pay-good\">But the pay is good, right?</h3><p>Like I said before I left the public sector, <a href=\"http://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">my base salary iscomparable to what I got at the lab</a>. It's actually gotten less competitive because <a href=\"https://www.theregister.com/2023/05/11/microsoft_pay_freeze/\">all salaries were frozen</a> when I was first eligible for a raise. So, after considering the effects of inflation, my paycheck is a little lower than what it was in the government two years ago.</p><p>What's different is the bonus structure which simply does not exist in the government or university world. For thosewho aren't familiar with how bonuses work in the tech industry, I'll share how it works for me:</p><p></p><ul><li>In the first year, I was awarded two signing bonuses: one in cash, one in stock. Half of the cash bonus was paidout up-front, and the other half was paid out after I had been there a year. The stock grant cannot be touchedduring the first year because it had a one-year \"cliff.\"</li><li>On my one-year anniversary, I got the second half of my cash signing bonus, and my signing stock grant began\"vesting.\"</li><li>After a year, I was also eligible for an annual performance-based raise, cash bonus, and stock bonus.</li><ul><li>Because of the economy, my annual raise was zero.</li><li>The cash bonus was paid out in a lump sum, similar to my cash signing bonus.</li><li>The stock bonus was awarded all at once but follows a multi-year \"vesting schedule\" which means I am onlyactually given fractions of the total award over time. However, these bonuses don't have a \"cliff\" and beginvesting immediately.</li></ul><li>Every year thereafter, I am eligible for an annual raise, cash bonus, and another stock bonus.</li></ul><p>The way stock bonuses work was the least intuitive part to me, but since it's such a significant part of total compensation, it's worth spellingout for anyone who's considering an offer that includes this:</p><p></p><ul><li>Stock bonuses are defined in terms of dollar values. For example, let's say I got a signing stock bonus of $1000with a one-year cliff that vests quarterly (every three months) over five years.</li><li>On the day that stock bonus is awarded, my employer converts that $1000 value into company stock based on themarket value that day. If stocks are $50 per share, I am awarded 20 shares. My employer hangs on to those shareson my behalf, so I can't actually do anything with them yet.</li><li>Since I have a five-year vesting schedule and the award vests quarterly, my shares will vest twenty times (fourquarters, five years). Coincidentally, since I have 20 shares, I will get one share per quarter.</li><li>However, because I have a one-year cliff, I get all four quarters of my first year at my one-year anniversary.So, four shares should appear in my brokerage account on my one-year anniversary. Once a share is in mybrokerage account, I can do whatever I want with it (like sell it immediately!)</li><li>Every quarter thereafter, one more share vests and appears in my brokerage account.</li></ul><p>Assuming I get a stock bonus as part of my overall annual bonus, this means that stockawards pile up and vest every year. This is tricky for two reasons:</p><p></p><ol><li>Although my initial stock award was $1,000 in the above example, that amount was converted to stock the day itwas awarded. <i>Assuming I am doing a good job and increasing the value of my employer's stock</i>, the value ofthose shares will increase while they're vesting. This means by the time the first four shares of my awardvested at my one-year anniversary, they were worth more than the $50 per share they represented when they wereawarded. More broadly, the value of a stock bonus tends to increase over time, making the true cash value of a$1000 stock bonus worth a lot more than $1000 by the time it completely vests.</li><li>Every year's stock award comes with its own multi-year vesting period, which means at any given time, I havemultiple years' bonuses all vesting at once. This also means that at any given time, I have a bunch of unvestedstock that's worth a lot of money that I can't yet spend. If I quit my job though, all these unvested sharesvanish into thin air.</li></ol><p>These two factors make up the golden handcuffs that people often talk about in industry.The longer I stick around, the more unvested stock I have hanging over my head, and it usually becomes increasinglyvaluable (yet inaccessible!) over time. The reality is that if you've put in a few years in Big Tech, you might haveyears' worth of base salary tied up in unvested stock that all goes away if you quit.</p><p>The end result is that although base salary is competitive with what you can make in a government HPC facility, there's a significant cash bonus that falls out of the sky once a year, andan appreciable amount of stock appears in your brokerage account every couple of months which you can turn aroundand sell for more cash. Depending on seniority and performance, these bonuses can add up to a significant fractionof base salary.</p><p>Finally, the above is consistent with what I've seen firsthand at two companies in Big Tech but may be different based on the role and the company. For example, field-facing roles in sales and support may be completely different beasts, and private companies and startups load things differently due to the value of equity.</p><h3 id=\"work-life-balance\">How's work-life balance?</h3><p>It hasn't been different than working in the government. Just like at a lab or university, some peoplework around the clock while others stick pretty close to the standard workday. There may be a higher concentrationof Type A personalities who put in a lot of time in Big Tech, and this may pressure others to keep up and also putin long hours, but there's rarely been an occasion where a manager expects staff to routinely work nights andweekends. Doing so would probably result in negative employee satisfaction scores which would roll up and eventuallyhave to be addressed.</p><p>Of course, there are cases where working odd hours is required to get the job done. BecauseI work for a global organization, I've had to get up early to meet with teams or customers in Europe. I've also hadto stay up late to meet with teams or customers in Asia. And in some particularly annoying days, I've had to do bothand wind up working from 5am to 8pm. But I never felt that I had no choice in the matter; I pulled these hoursbecause it was the right thing to do at the time. And I don't see this as being too different from the days when I'dwork sixteen-hour days, seven days a week, for the entire month of March to put together a paper for SC. Or dayswhen I'm at SC and am preparing talks, meeting with partners, and otherwise hustling from 8am to 1am for five daysstraight.</p><p>One big difference is the fact that my employer offers discretionary time off (\"unlimited vacation\"). This is a divisive topic in industry, but I see it as a positive for work-life balancebecause it underscores an emphasis on <i>outcomes</i> rather than <i>output</i>. I can take an afternoonoff or enjoy a long weekend with little fanfare, because <i>productivity</i> is infinitely more valuable that <i>presence</i>. Aslong as I do what needs to get done, I don't have to worry about timing vacations to ensure I am banking enough timeoff in between.</p><h2 id=\"what-i-miss\">Do you miss anything about working at the lab?</h2><div>Absolutely. There are a bunch of appealing things about working in a DOE lab (or an NSF center)that I've had to give up since coming to industry.</div><h3 id=\"freedom-to-have-an-off-day\">Freedom to have an off day</h3><p>Right before I finished graduate school, I hada conversation with <a href=\"https://engineering.lehigh.edu/faculty/edmund-webb-iii\">Professor Edmund Webb</a> soonafter he became a professor after a decade-long career at Sandia National Labs about life at the Labs. He said that,after becoming a professor, he lost the ability to just close the door to his office and focus onsomething he needed to get done for a day. I didn't really grasp what this meant at the time, but I totally get it now. TheDOE might be one of the few places where you can take a day--maybe even a week--and just close your door toeverything else that's going on around you to focus on what you want to do. In the case of professorship, there's always students requiring attention; in industry, it's customers and partners.</p><p>I think this difference results from two factors: very few things in publicHPC are very urgent, and the Labs are stocked full of independent, free-thinking Ph.D. types. There's rarely apenalty if something is late by a day (or two years! Remember when <a href=\"https://insidehpc.com/2020/10/doe-under-secretary-for-science-dabbars-exascale-update-frontier-to-be-first-aurora-to-be-monitored/\">Aurorawas called \"A21?\"</a>), but there can be huge payoff in prestige if one of your wacky side projects turns out to besomething useful (this is how <a href=\"https://docs.nersc.gov/development/containers/shifter/\">Shifter</a> came to be). By comparison, working at a giant corporation often means there are a bunch of interdependencieson others, and the odds of any one of your 200,000 coworkers sending you a Teams message asking for help is just a lot higher than it is at a 70-person supercomputer center. The culture is much more team-oriented, and being a one-person army isn't incentivized as much.</p><h3 id=\"travel\">Travel</h3><p>Part of my job within the DOE complex was to go around the country (and the world) and be smart, and secondarily,show that my lab hired smart people and did smart things. If headquarters wanted to make sure that the supercomputerthey were about to spend $500M on was technically sound, I'd sometimes get invited to go sit in on a review and tryto poke holes in the design. If a European HPC project wanted to ensure they were including a global perspective onsome dimension of future HPC strategy, I'd sometimes get invited to give a talk about how I view the world of data.And if these reviews and workshops happened to be in awesome places around the world--oh well!</p><p>I feel a lot more self-conscious about requesting approval to attend these sorts of boondoggles as an engineer nowbecause the first question I have to answer is, \"Is this trip business critical?\" If there's a direct line of sightbetween me giving a talk at a workshop and a specific business strategy, I can say \"yes\" with a straight face. But it'shard to accept an invitation to fly off to Switzerland to give a 30-minute talk when I know that my attendance isn'tgoing to move any needles.</p><h3 id=\"openness\">Openness</h3><p>Just like it's no longer my job to travel the world and just be smart, it's not my job to write about the work that I(or my team) does. I miss writing papers and giving technical talks, because the process of putting togethercoherent thoughts around a technical topic is one of the ways I really come to understand it. There's also a lot ofreally wild ideas that we're pursuing at scale that the scientific computingcommunity has never considered, but there are two factors that work against being open about these things:</p><p></p><ol><li>In terms of prioritization, my time is always better spent solving problems, or at least documenting them forinternal audiences who fully grasp the context around them, than writing about them in a way that the rest ofthe world can understand. It's hard to justify the time to write a retrospective or a study unless there's astrategic advantage behind it.</li><li>The customers I support typically do not want the world knowing what they're doing. There is an AI arms racehappening right now, and having the technical sophistication to utilize massive-scale supercomputers effectivelyis a competitive advantage. In the traditional HPC community, only national security is comparable to the levelof secrecy involved, and none of the intelligence agencies are openly contributing to the state of the art inHPC either.</li></ol><div>So instead of making conference papers and presentations, these days I make more internal papers and presentations.I'm trying to figure out ways to publish interesting technical anecdotes on my website (for example, I maintain <a href=\"https://www.glennklockwood.com/ai/ai-requirements.html\">a collection of LLM training requirements as I am exposed to them</a>), but it's a lot of extra work to disentangle the proprietary bits from my work notes to do this.</div><p></p><p>Related to openness is also freedom to speak my mind in public forums. I had the most latitude to blast myopinions out on to the Internet when I was still early in my career and nobody listened to me, but I've had to getprogressively less opinionated over the years. At this point, I abide by a written corporate social media policywhich, although very reasonable in what it requests (don't slander competitors, always be transparent about who employs you), it stops me from commenting on news as much as I used to since so many techcompanies qualify as competitors in some dimension.</p><h2 id=\"regret-decision\">Would you still have left knowing what you know now?</h2><p>Yes. I still stand by just about everything I wrote in my <a href=\"http://blog.glennklockwood.com/2022/05/life-and-leaving-nersc.html\">original blog post</a>; at the time, I just needed a change, and Ifound the change that I was looking for. Without immersing myself in the world of cloud, I would havenever learned about virtualization, physical infrastructure, or modern security to the degree that I have. And the fact that Istumbled into what has become one of the leading companies in AI at the dawn of generative AI was an extremely luckycoincidence.</p><p>However, this doesn't mean that I now turn my nose up at doing HPC in the public sector.There are many unique aspects to working at a DOE lab or NSF center that have no parallel in industry. I also believe that I amthe sum of the experiences that led me to where I work today, and I would never have gotten the opportunity to writethis retrospective if I didn't learn everything I did working in the DOE and NSF.</p><p>And perhaps above all else, there is something attractive about public service that I haven't beenable to shake in the last two years. I still dial in to <a href=\"https://science.osti.gov/ascr/ascac\">ASCACmeetings</a> to see what the world of public HPC and scientific computing is thinking and doing, and I still tryto contribute time and attention to working groups like <a href=\"https://www.nitrd.gov/coordination-areas/lsn/magic/\">NITRD's MAGIC</a>. I write lengthy blog posts in a <a href=\"http://blog.glennklockwood.com/2024/05/isc24-recap.html\">futile attempt to caution the leaders in public-sector HPC</a> againstrejecting AI workloads in commercial clouds as HPC. And every time I learn some slick way we deal with hard technological or sociological issues at work, I still file it away in the \"good ideas for when I goback\" folder in the back of my mind.</p><p>I don't have any near-term plans on going anywhere though. Like I said before, there arestill plenty of days when dialing into work is like going to the playground. Amazing things are happening in theworld of HPC infrastructure at scale now that the world is pouring money into AI, and the rate of scale andinnovation is no longer constrained to <a href=\"https://www.llnl.gov/article/48101/powering-llnl-prepares-exascale-massive-energy-water-upgrade\">40 MW</a>and <a href=\"https://www.olcf.ornl.gov/wp-content/uploads/OLCF-6-RFP-Cover-Letter-07-19-2024.pdf\">$500M</a> persupercomputer like it was when public-sector HPC was setting the bar for leadership. There is a whole new exciting world of challenges and possibilities when you start thinking about building supercomputers that consume <a href=\"https://www.datacenterdynamics.com/en/news/aws-acquires-talens-nuclear-data-center-campus-in-pennsylvania/\">hundreds of megawatts of power</a>.</p><p>Like I wrote two years ago, I don't think any government has the appetite to build data centers for scientific computing that are larger than today's 50 MW exascale facilities. This means that government HPC centers will never have a reason to explore the exciting world of 100+ MW supercomputers or work on the wacky problems that arise at that scale. Consequently, the biggest and most challenging problems in HPC--at least in terms of infrastructure and systems design at scale--are becoming unique to industry, not public HPC.</p><p>I got into HPC because I enjoy working on large, complex systems. Considering where I am at this stage of my life, what I want to accomplish in the rest of my career, and what gets me out of bed in the morning, I feel like I wound up in the right place for now. I have no regrets.</p>",
            "url": "https://hpc.social/personal-blog/2024/how-has-life-after-leaving-the-labs-been-going/",
            
            
            
            
            
            "date_published": "2024-08-04T20:21:00-06:00",
            "date_modified": "2024-08-04T20:21:00-06:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/advanced-lsf-resource-connector-configuration-on-ibm-cloud-part-iii/",
            "title": "Advanced LSF resource connector configuration on IBM Cloud - part III",
            "summary": null,
            "content_text": "OverviewThis is the third instalment in a series of blogs covering advanced configuration topics for LSF resource connector. The earlier parts in the series can be found here: part I, part II.As hinted in the closing of part II, this instalment will cover running Docker workloads on cloud instances which are dynamically managed by the LSF resource connector. The cloud environment in this example is IBM Cloud. To understand more about LSF resource connector, please read the earlier parts in the blog series.LSF provides a framework for the management and execution of containerized workloads. It supports the following container runtimes: Docker, NVIDIA Docker, Shifter, Singularity, Podman and Enroot. The LSF documentation provides configuration steps for the supported container runtimes. Once configured, this capability is effectively transparent from the end user perspective.Enable Docker supportFirst we need to enable support in LSF to run Docker containers. This is covered in detail in the LSF documentation and also something which I wrote about previously in the blog post Jack of all containers. The following steps will assume that the configuration steps have been completed.LSF uses a Boolean resource named docker to identify hosts where the Docker runtime is available. This Boolean resource needs to be set on the compute nodes which are dynamically started by LSF resource connector.In our example, an insecure Docker repository (using http) has been setup on the LSF manager host in the cluster with hostname lsf-mgmt-host. This will serve as the repository to host an OpenFOAM Docker container which has been prepared according to the procedures documented here. This blog will not go into detail on the creation of the insecure Docker registry. On the LSF management node, below is the output showing the available images. We see the OpenFoam image is available both locally and via http on port 5000.# docker image lsREPOSITORY                   TAG           IMAGE ID      CREATED        SIZElocalhost/openfoam/openfoam  v1912_update  bce4eb059f36  11 days ago    6.71 GBlocalhost:5000/openfoam      v1912_update  bce4eb059f36  11 days ago    6.71 GBdocker.io/library/registry   2             6a3edb1d5eb6  10 months ago  26 MBNote An insecure Docker registry was used in this example for simplicity and is not recommended in production.As was the case in part II of the blog series, the user_data.sh script will be used for multiple purposes here:Set docker Boolean variable on dynamic compute nodesInstall Docker CE runtime and relevant support packagesAdd user(s) to the docker group (/etc/group)Configuration to point to insecure Docker registry on LSF management hostlsf-mgmt-hostThe following updates were made to the user_data.sh script. See comments inline for details.$ diff -u4 ./user_data.sh ./user_data_sh.org--- ./user_data.sh\t2024-07-29 18:44:24.483146000 +0000+++ ./user_data_sh.org\t2024-07-11 14:34:47.688341000 +0000@@ -29,25 +29,8 @@  #!/bin/bash # shellcheck disable=all -# -# The following steps will add the Docker CE repo, install the latest Docker CE-# version along with supporting packages. It will create a Docker Linux group-# and add the lsfadmin user to that group. Furthermore, it will create-# the /etc/docker/daemon.json file pointing to the insecure Docker registry-# which has been configured on the LSF management host. Finally it will-# start Docker. Note that the hostname lsf-mgmt-host for the insecure-registries-# configuration of Docker needs to be updated accordingly. -# -yum-config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo -y -dnf install htop hwloc hwloc-libs libevent stress stress-ng python36 docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y  &gt;&gt; $logfile 2&gt;&amp;1-ln -s /usr/bin/python3 /usr/bin/python-groupadd docker &gt;&gt; $logfile 2&gt;&amp;1-usermod -aG docker lsfadmin  &gt;&gt; $logfile 2&gt;&amp;1 -echo -e \"{\\n \\\"insecure-registries\\\" : [ \\”lsf-mgmt-host:5000\\\" ]\\n }\" &gt;&gt; /etc/docker/daemon.json -systemctl start docker &gt;&gt; $logfile 2&gt;&amp;1  - if [ \"$compute_user_data_vars_ok\" != \"1\" ]; then   echo 2&gt;&amp;1 \"fatal: vars block is missing\"   exit 1 fi@@ -225,15 +208,8 @@ else   echo \"Can not get instance ID\" &gt;&gt; $logfile fi -# -# Add the docker Boolean variable to the LSF_LOCAL_RESOURCES variable in-# the lsf.conf file on the compute hosts. This will ensure that the host-# is tagged with the docker variable. -# -sed -i \"s/\\(LSF_LOCAL_RESOURCES=.*\\)\\\"/\\1 [resource docker]\\\"/\" $LSF_CONF_FILE &gt;&gt; $logfile 2&gt;&amp;1 - #Update LSF Tuning on dynamic hosts LSF_TUNABLES=\"etc/sysctl.conf\" echo 'vm.overcommit_memory=1' &gt;&gt; $LSF_TUNABLES echo 'net.core.rmem_max=26214400' &gt;&gt; $LSF_TUNABLESApplication profile configurationNext, we configure the LSF application profile for the OpenFOAM Docker container which has been loaded into the insecure Docker registry on the LSF management host. LSF application profiles can be used to define common job parameters for the same job type. This includes the container and container runtime definition. Learn more about LSF application profiles here.On the LSF management node, the following application profile is defined in $LSF_ENVDIR/lsbatch/&lt;clustername&gt;/configdir/lsb.applications. Note that the hostname lsf-mgmt-host must point to the hostname where the insecure Docker repository has been setup in your environment. Additionally the volume specification -v /mnt/vpcstorage/data is specific to this environment and can be adjusted or removed as needed.….….Begin ApplicationNAME = openfoamDESCRIPTION = Example OpenFOAM applicationCONTAINER = docker[image(lsf-mgmt-host:5000/openfoam:v1912_update) \\   options(--rm --net=host --ipc=host \\   --cap-add=SYS_PTRACE \\   -v /etc/passwd:/etc/passwd \\   -v /etc/group:/etc/group \\   -v /mnt/vpcstorage/data:/mnt/vpcstorage/data \\    ) starter(root)]   EXEC_DRIVER = context[user(lsfadmin)] \\   starter[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-starter.py] \\   controller[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-control.py] \\   monitor[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-monitor.py]End Application….….In order to make the above change take effect, run the badmin reconfig command as the defined LSF administrator. The LSF bapp command can be used to check the newly defined configuration for LSF application profile openfoam.$ badmin reconfigChecking configuration files ...No errors found.Reconfiguration initiated$ bapp -l openfoamAPPLICATION NAME: openfoam -- Example OpenFOAM applicationSTATISTICS:   NJOBS     PEND      RUN    SSUSP    USUSP      RSV        0        0        0        0        0        0PARAMETERS:CONTAINER: docker[image(lsf-mgmt-host:5000/openfoam:v1912_update)    options(--rm --net=host --ipc=host    --cap-add=SYS_PTRACE    -v /etc/passwd:/etc/passwd    -v /etc/group:/etc/group    -v /mnt/vpcstorage/data:/mnt/vpcstorage/data    ) starter(root)]EXEC_DRIVER:     context[user(lsfadmin)]    starter[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-starter.py]    controller[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-control.py]    monitor[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-monitor.py]Submitting workloadWith all of the configuration in place, it’s now time to submit an OpenFOAM workload. For this, LSF Application Center is used. The OpenFOAM application template is available on the Spectrum Computing github here. The OpenFOAM application template is configured to use the openfoam application profile. An example job is submitted and it runs to completion successfully. In the screenshot below, we see that the openfoam Docker container is executed.The LSF bjobs and bhist output from the job follows below:$ bjobs -l 2613Job &lt;2613&gt;, Job Name &lt;myOpenFoam_run_motorBike&gt;, User &lt;lsfadmin&gt;, Project &lt;defa                     ult&gt;, Application &lt;openfoam&gt;, Status &lt;RUN&gt;, Queue &lt;normal&gt;                     , Command &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_ru                     n_1722358195200AuWDY/motorBike/bsub.myOpenFoam_run&gt;, Share                      group charged &lt;/lsfadmin&gt;Tue Jul 30 16:49:55: Submitted from host &lt;gsamu-hpc-demo-mgmt-1-a844-001&gt;, CWD                      &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_17223581                     95200AuWDY/motorBike&gt;, Specified CWD &lt;/mnt/lsf/repository-                     path/lsfadmin/myOpenFoam_run_1722358195200AuWDY/motorBike&gt;                     , Output File &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoa                     m_run_1722358195200AuWDY/motorBike/output.lsfadmin.txt&gt;, E                     rror File &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_ru                     n_1722358195200AuWDY/motorBike/error.lsfadmin.txt&gt;, Notify                      when job begins/ends, 6 Task(s), Requested Resources &lt;spa                     n[hosts=1]&gt;;Tue Jul 30 16:55:23: Started 6 Task(s) on Host(s) &lt;gsamu-hpc-demo-10-241-0-137&gt;                     &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137                     &gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-1                     37&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt;, Allocated 6 Slot(s) on                      Host(s) &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-2                     41-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10                     -241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-                     10-241-0-137&gt;, Execution Home &lt;/home/lsfadmin&gt;, Execution                      CWD &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_1722                     358195200AuWDY/motorBike&gt;;Tue Jul 30 17:03:33: Resource usage collected.                     The CPU time used is 1411 seconds.                     MEM: 928 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 41                     PGID: 18426;  PIDs: 18426 18427 18428 20088                      PGID: 20374;  PIDs: 20374 20388 20800 21385                      PGID: 21389;  PIDs: 21389                      PGID: 21390;  PIDs: 21390                      PGID: 21391;  PIDs: 21391                      PGID: 21392;  PIDs: 21392                      PGID: 21393;  PIDs: 21393                      PGID: 21394;  PIDs: 21394  MEMORY USAGE: MAX MEM: 982 Mbytes;  AVG MEM: 422 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 5.89 ;  CPU PEAK DURATION: 63 second(s) CPU AVERAGE EFFICIENCY: 42.81% ;  CPU PEAK EFFICIENCY: 98.15% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      -   loadStop    -     -     -     -       -     -    -     -     -      -      -   RESOURCE REQUIREMENT DETAILS: Combined: select[(docker) &amp;&amp; (type == any)] order[r15s:pg] span[hosts=1] Effective: select[(docker) &amp;&amp; (type == any)] order[r15s:pg] span[hosts=1] $ bhist -l 2613Job &lt;2613&gt;, Job Name &lt;myOpenFoam_run_motorBike&gt;, User &lt;lsfadmin&gt;, Project &lt;defa                     ult&gt;, Application &lt;openfoam&gt;, Command &lt;/mnt/lsf/repository                     -path/lsfadmin/myOpenFoam_run_1722358195200AuWDY/motorBike                     /bsub.myOpenFoam_run&gt;Tue Jul 30 16:49:55: Submitted from host &lt;gsamu-hpc-demo-mgmt-1-a844-001&gt;, to Q                     ueue &lt;normal&gt;, CWD &lt;/mnt/lsf/repository-path/lsfadmin/myOp                     enFoam_run_1722358195200AuWDY/motorBike&gt;, Specified CWD &lt;/                     mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_1722358195                     200AuWDY/motorBike&gt;, Output File &lt;/mnt/lsf/repository-path                     /lsfadmin/myOpenFoam_run_1722358195200AuWDY/motorBike/outp                     ut.lsfadmin.txt&gt;, Error File &lt;/mnt/lsf/repository-path/lsf                     admin/myOpenFoam_run_1722358195200AuWDY/motorBike/error.ls                     fadmin.txt&gt;, Notify when job begins/ends, 6 Task(s), Reque                     sted Resources &lt;span[hosts=1]&gt;;Tue Jul 30 16:55:23: Dispatched 6 Task(s) on Host(s) &lt;gsamu-hpc-demo-10-241-0-1                     37&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0                     -137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241                     -0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt;, Allocated 6 Slot(s)                      on Host(s) &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-                     10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-dem                     o-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-d                     emo-10-241-0-137&gt;, Effective RES_REQ &lt;select[(docker) &amp;&amp; (                     type == any)] order[r15s:pg] span[hosts=1] &gt;;Tue Jul 30 16:55:23: Starting (Pid 18426);Tue Jul 30 16:55:24: Running with execution home &lt;/home/lsfadmin&gt;, Execution CW                     D &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_172235                     8195200AuWDY/motorBike&gt;, Execution Pid &lt;18426&gt;;Tue Jul 30 17:04:01: Done successfully. The CPU time used is 1535.1 seconds;Tue Jul 30 17:04:02: Post job process done successfully;MEMORY USAGE:MAX MEM: 982 Mbytes;  AVG MEM: 431 Mbytes; MEM Efficiency: 0.00%CPU USAGE:CPU PEAK: 5.92 ;  CPU PEAK DURATION: 63 second(s)CPU AVERAGE EFFICIENCY: 50.67% ;  CPU PEAK EFFICIENCY: 98.68%Summary of time in seconds spent in various states by  Tue Jul 30 17:04:02  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL  328      0        518      0        0        0        846         ConclusionThe user_data.sh script of LSF resource connector allows a high degree of customization for cloud compute resources that dynamically join the LSF cluster. We’ve demonstrated how it can be used to tag cloud compute resources with a specific LSF Boolean resource in addition to the ability to install specific packages and do configuration customization. This is a simplified example, but illustrates this point.",
            "content_html": "<p><strong>Overview</strong></p><p>This is the third instalment in a series of blogs covering advanced configuration topics for LSF resource connector. The earlier parts in the series can be found here: <a href=\"https://community.ibm.com/community/user/cloud/blogs/gbor-samu/2023/11/09/advanced-resource-connector-configuration-on-ibm-c\">part I</a>, <a href=\"https://community.ibm.com/community/user/cloud/blogs/gbor-samu/2024/03/20/advanced-lsf-resource-connector-configuration-on-i\">part II</a>.</p><p>As hinted in the closing of part II, this instalment will cover running Docker workloads on cloud instances which are dynamically managed by the LSF resource connector. The cloud environment in this example is <a href=\"https://cloud.ibm.com/catalog/content/terraform-1623200063-71606cab-c6e1-4f95-a47a-2ce541dcbed8-global\">IBM Cloud</a>. To understand more about LSF resource connector, please read the earlier parts in the blog series.</p><p><a href=\"https://www.ibm.com/products/hpc-workload-management\">LSF</a> provides a framework for the management and execution of containerized workloads. It supports the following container runtimes: Docker, NVIDIA Docker, Shifter, Singularity, Podman and Enroot. The LSF <a href=\"https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=lsf-configuring-containers\">documentation</a> provides configuration steps for the supported container runtimes. Once configured, this capability is effectively transparent from the end user perspective.</p><p><strong>Enable Docker support</strong></p><p>First we need to enable support in LSF to run Docker containers. This is covered in detail in the LSF <a href=\"https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=containers-lsf-docker\">documentation</a> and also something which I wrote about previously in the blog post <a href=\"https://medium.com/ibm-data-ai/jack-of-all-containers-e0d7fd0633b3\">Jack of all containers</a>. The following steps will assume that the configuration steps have been completed.</p><p>LSF uses a Boolean resource named <em>docker</em> to identify hosts where the Docker runtime is available. This Boolean resource needs to be set on the compute nodes which are dynamically started by LSF resource connector.</p><p>In our example, an insecure Docker repository (using http) has been setup on the LSF manager host in the cluster with hostname <em>lsf-mgmt-host</em>. This will serve as the repository to host an OpenFOAM Docker container which has been prepared according to the procedures documented <a href=\"https://community.ibm.com/community/user/cloud/blogs/john-welch/2020/02/12/building-an-openfoam-ready-container-for-lsf\">here</a>. This blog will not go into detail on the creation of the insecure Docker registry. On the LSF management node, below is the output showing the available images. We see the OpenFoam image is available both locally and via http on port 5000.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\"># docker image lsREPOSITORY                   TAG           IMAGE ID      CREATED        SIZElocalhost/openfoam/openfoam  v1912_update  bce4eb059f36  11 days ago    6.71 GBlocalhost:5000/openfoam      v1912_update  bce4eb059f36  11 days ago    6.71 GBdocker.io/library/registry   2             6a3edb1d5eb6  10 months ago  26 MB</code></pre></div><p><strong>Note</strong> An insecure Docker registry was used in this example for simplicity and is not recommended in production.</p><p>As was the case in part II of the blog series, the <em>user_data.sh</em> script will be used for multiple purposes here:</p><ul><li>Set <em>docker</em> Boolean variable on dynamic compute nodes</li><li>Install Docker CE runtime and relevant support packages</li><li>Add user(s) to the docker group (<em>/etc/group</em>)</li><li>Configuration to point to insecure Docker registry on LSF management host<em>lsf-mgmt-host</em></li></ul><p>The following updates were made to the <em>user_data.sh</em> script. See comments inline for details.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ diff -u4 ./user_data.sh ./user_data_sh.org--- ./user_data.sh\t2024-07-29 18:44:24.483146000 +0000+++ ./user_data_sh.org\t2024-07-11 14:34:47.688341000 +0000@@ -29,25 +29,8 @@  #!/bin/bash # shellcheck disable=all -# -# The following steps will add the Docker CE repo, install the latest Docker CE-# version along with supporting packages. It will create a Docker Linux group-# and add the lsfadmin user to that group. Furthermore, it will create-# the /etc/docker/daemon.json file pointing to the insecure Docker registry-# which has been configured on the LSF management host. Finally it will-# start Docker. Note that the hostname lsf-mgmt-host for the insecure-registries-# configuration of Docker needs to be updated accordingly. -# -yum-config-manager --add-repo https://download.docker.com/linux/rhel/docker-ce.repo -y -dnf install htop hwloc hwloc-libs libevent stress stress-ng python36 docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin -y  &gt;&gt; $logfile 2&gt;&amp;1-ln -s /usr/bin/python3 /usr/bin/python-groupadd docker &gt;&gt; $logfile 2&gt;&amp;1-usermod -aG docker lsfadmin  &gt;&gt; $logfile 2&gt;&amp;1 -echo -e \"{\\n \\\"insecure-registries\\\" : [ \\”lsf-mgmt-host:5000\\\" ]\\n }\" &gt;&gt; /etc/docker/daemon.json -systemctl start docker &gt;&gt; $logfile 2&gt;&amp;1  - if [ \"$compute_user_data_vars_ok\" != \"1\" ]; then   echo 2&gt;&amp;1 \"fatal: vars block is missing\"   exit 1 fi@@ -225,15 +208,8 @@ else   echo \"Can not get instance ID\" &gt;&gt; $logfile fi -# -# Add the docker Boolean variable to the LSF_LOCAL_RESOURCES variable in-# the lsf.conf file on the compute hosts. This will ensure that the host-# is tagged with the docker variable. -# -sed -i \"s/\\(LSF_LOCAL_RESOURCES=.*\\)\\\"/\\1 [resource docker]\\\"/\" $LSF_CONF_FILE &gt;&gt; $logfile 2&gt;&amp;1 - #Update LSF Tuning on dynamic hosts LSF_TUNABLES=\"etc/sysctl.conf\" echo 'vm.overcommit_memory=1' &gt;&gt; $LSF_TUNABLES echo 'net.core.rmem_max=26214400' &gt;&gt; $LSF_TUNABLES</code></pre></div><p><strong>Application profile configuration</strong></p><p>Next, we configure the LSF application profile for the OpenFOAM Docker container which has been loaded into the insecure Docker registry on the LSF management host. LSF application profiles can be used to define common job parameters for the same job type. This includes the container and container runtime definition. Learn more about LSF application profiles <a href=\"https://www.ibm.com/docs/en/spectrum-lsf/10.1.0?topic=lsf-application-profiles\">here</a>.</p><p>On the LSF management node, the following application profile is defined in <em>$LSF_ENVDIR/lsbatch/&lt;clustername&gt;/configdir/lsb.applications</em>. Note that the hostname <em>lsf-mgmt-host</em> must point to the hostname where the insecure Docker repository has been setup in your environment. Additionally the volume specification <em>-v /mnt/vpcstorage/data</em> is specific to this environment and can be adjusted or removed as needed.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\">….….Begin ApplicationNAME = openfoamDESCRIPTION = Example OpenFOAM applicationCONTAINER = docker[image(lsf-mgmt-host:5000/openfoam:v1912_update) \\   options(--rm --net=host --ipc=host \\   --cap-add=SYS_PTRACE \\   -v /etc/passwd:/etc/passwd \\   -v /etc/group:/etc/group \\   -v /mnt/vpcstorage/data:/mnt/vpcstorage/data \\    ) starter(root)]   EXEC_DRIVER = context[user(lsfadmin)] \\   starter[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-starter.py] \\   controller[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-control.py] \\   monitor[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-monitor.py]End Application….….</code></pre></div><p>In order to make the above change take effect, run the <em>badmin reconfig</em> command as the defined LSF administrator. The LSF <em>bapp</em> command can be used to check the newly defined configuration for LSF application profile <em>openfoam</em>.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ badmin reconfigChecking configuration files ...No errors found.Reconfiguration initiated$ bapp -l openfoamAPPLICATION NAME: openfoam -- Example OpenFOAM applicationSTATISTICS:   NJOBS     PEND      RUN    SSUSP    USUSP      RSV        0        0        0        0        0        0PARAMETERS:CONTAINER: docker[image(lsf-mgmt-host:5000/openfoam:v1912_update)    options(--rm --net=host --ipc=host    --cap-add=SYS_PTRACE    -v /etc/passwd:/etc/passwd    -v /etc/group:/etc/group    -v /mnt/vpcstorage/data:/mnt/vpcstorage/data    ) starter(root)]EXEC_DRIVER:     context[user(lsfadmin)]    starter[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-starter.py]    controller[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-control.py]    monitor[/opt/ibm/lsf_worker/10.1/linux3.10-glibc2.17-x86_64/etc/docker-monitor.py]</code></pre></div><p><strong>Submitting workload</strong></p><p>With all of the configuration in place, it’s now time to submit an OpenFOAM workload. For this, LSF Application Center is used. The OpenFOAM application template is available on the Spectrum Computing github <a href=\"https://github.com/IBMSpectrumComputing/lsf-integrations\">here</a>. The OpenFOAM application template is configured to use the <em>openfoam</em> application profile. An example job is submitted and it runs to completion successfully. In the screenshot below, we see that the openfoam Docker container is executed.</p><figure><img src=\"https://www.gaborsamu.com/images/openfoam_job.jpg\" /></figure><p>The LSF <em>bjobs</em> and <em>bhist</em> output from the job follows below:</p><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bjobs -l 2613Job &lt;2613&gt;, Job Name &lt;myOpenFoam_run_motorBike&gt;, User &lt;lsfadmin&gt;, Project &lt;defa                     ult&gt;, Application &lt;openfoam&gt;, Status &lt;RUN&gt;, Queue &lt;normal&gt;                     , Command &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_ru                     n_1722358195200AuWDY/motorBike/bsub.myOpenFoam_run&gt;, Share                      group charged &lt;/lsfadmin&gt;Tue Jul 30 16:49:55: Submitted from host &lt;gsamu-hpc-demo-mgmt-1-a844-001&gt;, CWD                      &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_17223581                     95200AuWDY/motorBike&gt;, Specified CWD &lt;/mnt/lsf/repository-                     path/lsfadmin/myOpenFoam_run_1722358195200AuWDY/motorBike&gt;                     , Output File &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoa                     m_run_1722358195200AuWDY/motorBike/output.lsfadmin.txt&gt;, E                     rror File &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_ru                     n_1722358195200AuWDY/motorBike/error.lsfadmin.txt&gt;, Notify                      when job begins/ends, 6 Task(s), Requested Resources &lt;spa                     n[hosts=1]&gt;;Tue Jul 30 16:55:23: Started 6 Task(s) on Host(s) &lt;gsamu-hpc-demo-10-241-0-137&gt;                     &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137                     &gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-1                     37&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt;, Allocated 6 Slot(s) on                      Host(s) &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-2                     41-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10                     -241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-                     10-241-0-137&gt;, Execution Home &lt;/home/lsfadmin&gt;, Execution                      CWD &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_1722                     358195200AuWDY/motorBike&gt;;Tue Jul 30 17:03:33: Resource usage collected.                     The CPU time used is 1411 seconds.                     MEM: 928 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 41                     PGID: 18426;  PIDs: 18426 18427 18428 20088                      PGID: 20374;  PIDs: 20374 20388 20800 21385                      PGID: 21389;  PIDs: 21389                      PGID: 21390;  PIDs: 21390                      PGID: 21391;  PIDs: 21391                      PGID: 21392;  PIDs: 21392                      PGID: 21393;  PIDs: 21393                      PGID: 21394;  PIDs: 21394  MEMORY USAGE: MAX MEM: 982 Mbytes;  AVG MEM: 422 Mbytes; MEM Efficiency: 0.00% CPU USAGE: CPU PEAK: 5.89 ;  CPU PEAK DURATION: 63 second(s) CPU AVERAGE EFFICIENCY: 42.81% ;  CPU PEAK EFFICIENCY: 98.15% SCHEDULING PARAMETERS:           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem loadSched   -     -     -     -       -     -    -     -     -      -      -   loadStop    -     -     -     -       -     -    -     -     -      -      -   RESOURCE REQUIREMENT DETAILS: Combined: select[(docker) &amp;&amp; (type == any)] order[r15s:pg] span[hosts=1] Effective: select[(docker) &amp;&amp; (type == any)] order[r15s:pg] span[hosts=1] </code></pre></div><div class=\"highlight\"><pre><code class=\"language-plaintext\">$ bhist -l 2613Job &lt;2613&gt;, Job Name &lt;myOpenFoam_run_motorBike&gt;, User &lt;lsfadmin&gt;, Project &lt;defa                     ult&gt;, Application &lt;openfoam&gt;, Command &lt;/mnt/lsf/repository                     -path/lsfadmin/myOpenFoam_run_1722358195200AuWDY/motorBike                     /bsub.myOpenFoam_run&gt;Tue Jul 30 16:49:55: Submitted from host &lt;gsamu-hpc-demo-mgmt-1-a844-001&gt;, to Q                     ueue &lt;normal&gt;, CWD &lt;/mnt/lsf/repository-path/lsfadmin/myOp                     enFoam_run_1722358195200AuWDY/motorBike&gt;, Specified CWD &lt;/                     mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_1722358195                     200AuWDY/motorBike&gt;, Output File &lt;/mnt/lsf/repository-path                     /lsfadmin/myOpenFoam_run_1722358195200AuWDY/motorBike/outp                     ut.lsfadmin.txt&gt;, Error File &lt;/mnt/lsf/repository-path/lsf                     admin/myOpenFoam_run_1722358195200AuWDY/motorBike/error.ls                     fadmin.txt&gt;, Notify when job begins/ends, 6 Task(s), Reque                     sted Resources &lt;span[hosts=1]&gt;;Tue Jul 30 16:55:23: Dispatched 6 Task(s) on Host(s) &lt;gsamu-hpc-demo-10-241-0-1                     37&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0                     -137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241                     -0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt;, Allocated 6 Slot(s)                      on Host(s) &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-demo-                     10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-dem                     o-10-241-0-137&gt; &lt;gsamu-hpc-demo-10-241-0-137&gt; &lt;gsamu-hpc-d                     emo-10-241-0-137&gt;, Effective RES_REQ &lt;select[(docker) &amp;&amp; (                     type == any)] order[r15s:pg] span[hosts=1] &gt;;Tue Jul 30 16:55:23: Starting (Pid 18426);Tue Jul 30 16:55:24: Running with execution home &lt;/home/lsfadmin&gt;, Execution CW                     D &lt;/mnt/lsf/repository-path/lsfadmin/myOpenFoam_run_172235                     8195200AuWDY/motorBike&gt;, Execution Pid &lt;18426&gt;;Tue Jul 30 17:04:01: Done successfully. The CPU time used is 1535.1 seconds;Tue Jul 30 17:04:02: Post job process done successfully;MEMORY USAGE:MAX MEM: 982 Mbytes;  AVG MEM: 431 Mbytes; MEM Efficiency: 0.00%CPU USAGE:CPU PEAK: 5.92 ;  CPU PEAK DURATION: 63 second(s)CPU AVERAGE EFFICIENCY: 50.67% ;  CPU PEAK EFFICIENCY: 98.68%Summary of time in seconds spent in various states by  Tue Jul 30 17:04:02  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL  328      0        518      0        0        0        846         </code></pre></div><p><strong>Conclusion</strong></p><p>The <em>user_data.sh</em> script of LSF resource connector allows a high degree of customization for cloud compute resources that dynamically join the LSF cluster. We’ve demonstrated how it can be used to tag cloud compute resources with a specific LSF Boolean resource in addition to the ability to install specific packages and do configuration customization. This is a simplified example, but illustrates this point.</p>",
            "url": "https://hpc.social/personal-blog/2024/advanced-lsf-resource-connector-configuration-on-ibm-cloud-part-iii/",
            
            
            
            
            
            "date_published": "2024-08-01T13:08:20-06:00",
            "date_modified": "2024-08-01T13:08:20-06:00",
            
                "author": "Ramblings of a supercomputing enthusiast."
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/sustain-is-a-verb-code-is-sustained-or-not-sustained-not-sustainable/",
            "title": "Sustain is a Verb- Code is Sustained or Not Sustained, not 'Sustainable'",
            "summary": null,
            "content_text": "(Note: This post is adapted from #182 of the Research Computing Teams Newsletter)So you, reader, will already understand that software is not “sustainable”.  There’s no sustainability linter you can run over the code to highlight possible sustainability issues, no test suite you can run to check for sustainability regressions.   Sustainability is not an inherent property of a piece of software.Same with a computing system, or a curated database, or..Instead, these efforts are sustained, or not, by people or organizations who pay for it to be sustained.Those people or organizations do this sustaining because they (or a community they support) need that software or other effort to do their jobs.  Because there are users who advocate for sustaining the effort, or sustain it themselves.In other words, sustaining is something a community does, and to the extent that “sustainability” in this sense is a thing that exists at all, one enhances an efforts’ “sustainability” by nurturing and supporting that community, and making it easy for them to effectively advocate for continued sustenance.Even so, the need the community has for that effort is going to wax and wane over time, as will the sustaining.   Eventually, at some point, the community will move on or dissolve entirely, and the sustaining will come to an end.So a tool goes from a prototype or something bespoke for one problem, and grows in technological readiness (#91) to become RCD development, not just research (#119), and over time gathers a community which, with luck, will sustain the effort for as long as the community exists and needs it.It may not find or build such a community - in startup speak it may not find “product-market fit”, and fade away (as with Sochat’s article on updated software in #172).  That’s disappointing for the individuals involved, but it’s very much the nature of research - not every idea or effort pans out.This understanding of sustaining vs sustainability is finally starting to gain wider acceptance, which heartens me.  We need to understand that passively hoping that by checking off a list of criteria we’ve proved ourselves worthy, and that therefore sustaining funding will somehwo happen, is to let down our community and our users.   We have to actively create the communities that will sustain the work we start.",
            "content_html": "<p>(Note: This post is adapted from <a href=\"https://www.researchcomputingteams.org/newsletter_issues/0182\">#182</a> of the <a href=\"https://www.researchcomputingteams.org\">Research Computing Teams Newsletter</a>)</p><p>So you, reader, will already understand that software is not “sustainable”.  There’s no sustainability linter you can run over the code to highlight possible sustainability issues, no test suite you can run to check for sustainability regressions.   Sustainability is not an inherent property of a piece of software.</p><p>Same with a computing system, or a curated database, or..</p><p>Instead, these efforts are sustained, or not, by people or organizations who pay for it to be sustained.</p><p>Those people or organizations do this sustaining because they (or a community they support) need that software or other effort to do their jobs.  Because there are users who advocate for sustaining the effort, or sustain it themselves.</p><p>In other words, sustaining is something a <strong>community</strong> does, and to the extent that “sustainability” in this sense is a thing that exists at all, one enhances an efforts’ “sustainability” by nurturing and supporting that community, and making it easy for them to effectively advocate for continued sustenance.</p><p>Even so, the need the community has for that effort is going to wax and wane over time, as will the sustaining.   Eventually, at some point, the community will move on or dissolve entirely, and the sustaining will come to an end.</p><p>So a tool goes from a prototype or something bespoke for one problem, and grows in <a href=\"https://www.researchcomputingteams.org/newsletter_issues/0091\">technological readiness</a> (#91) to become <a href=\"https://www.researchcomputingteams.org/newsletter_issues/0119\">RCD development, not just research</a> (#119), and over time gathers a community which, with luck, will sustain the effort for as long as the community exists and needs it.</p><p>It may not find or build such a community - in startup speak it may not find “product-market fit”, and fade away (as with Sochat’s article on updated software in #<a href=\"https://www.researchcomputingteams.org/newsletter_issues/0172\">172</a>).  That’s disappointing for the individuals involved, but it’s very much the nature of research - not every idea or effort pans out.</p><p>This understanding of sustaining vs sustainability is finally <em>starting</em> to gain wider acceptance, which heartens me.  We need to understand that passively hoping that by checking off a list of criteria we’ve proved ourselves worthy, and that therefore sustaining funding will somehwo happen, is to let down our community and our users.   We have to actively create the communities that will sustain the work we start.</p>",
            "url": "https://hpc.social/personal-blog/2024/sustain-is-a-verb-code-is-sustained-or-not-sustained-not-sustainable/",
            
            
            
            
            
            "date_published": "2024-06-02T00:00:00-06:00",
            "date_modified": "2024-06-02T00:00:00-06:00",
            
                "author": "Jonathan Dursi's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/isc-24-recap/",
            "title": "ISC’24 recap",
            "summary": null,
            "content_text": "I had the great pleasure of attending the ISC High Performance conference this month, marking the fifth time I've attended what has become one of my top must-attend industry conferences of the year. This year was particularly meaningful to me because it is the first time that:I attended ISC as a Microsoft employee. This is also the first time I've attended any HPC conference since I changed my focus from storage into AI infrastructure.I attended ISC in-person since before the pandemic. It's also the first time I've visited Hamburg which turned out to be an absolute delight.Although registrations have been lower since the pandemic, this year's final registration count was over 3,400 attendees, and there was no shortage of old and new colleagues to bump into walking between the sessions at the beautiful Congress Center Hamburg.&lt;p&gt;This year’s theme was “Reinvent HPC,” and that idea—that HPC needs to reinvent itself—was pervasive throughout the program. The whole industry had been pulling towards exascale for the better part of a decade, and now that there are two exaflop systems on Top500 and the dust is settling, it feels like everyone is struggling to figure out what’s next. Is it quantum? AI?&lt;/p&gt;It was difficult for me to draw a line through all the topics worth reviewing at this year's ISC, as it was a very dense four days packed with a variety of topics, discussions, vendors, and events. I only experienced a fraction of everything there was to be seen since so many interesting sessions overlapped, but I thought it might be worthwhile to share my perspective of the conference and encourage others to do the same.Table of ContentsReinventing HPC (and blast those hyperscalers!)Kathy Yelick's opening keynoteClosing keynotes on the futureTop500 and Aurora#1 - Frontier#2 - Aurora#3 - EagleOther notable tidbitsEveryone is an AI expert!The Exascale AI Synergies LLM Workflows BOFAI Systems for Science and ZettascaleReal applications of generative AI for scienceHigh Performance Software FoundationQuantum computingReinvent HPC to include urgent computing?The Urgent Computing focus sessionThe Interactive and Urgent HPC workshopConcluding thoughtsReinventing HPC (and blast those hyperscalers!)The need to reinvent HPC was the prevailing theme of the conference from the very first session; with the listing of Aurora as the second system on Top500 to break the 1 exaflops barrier, the community is in search of a new milestone to drive research (and funding!). At the same time, commercial AI has rapidly risen up largely in an independent, parallel effort with a speed and scale that begs the question: how important was the decade-long drive to break the exaflops barrier if the AI industry could catch up so quickly without the help of the institutions that have historically posted the top HPL scores? If the commercial AI industry overtakes scientific computing as the world leader in deploying at scale, how can “HPC” be reinvented so it can continue to claim leadership in another dimension?Kathy Yelick's opening keynoteISC’s opening keynote was given by Kathy Yelick, where she provided commentary on two recent government-commissioned reports on the future of HPC:Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration, commissioned by the National AcademiesCan the United States Maintain Its Leadership in High-Performance Computing?, commissioned by the US Department of Energy’s Advanced Scientific Computing Research programLiving up to her reputation, Dr. Yelick’s talk was fast and insightful, describing the insatiable demand for computing driven by scientific research, the struggle to expose continuing amounts of parallelism to make use of newer processors, and some promising directions to address that disconnect. However, her talk started in a direction that I didn’t like when she went into describing the disruptors that necessitate reinventing HPC:The above slide implied that AI, quantum, or cloud may pose an existential threat to the HPC community gathered at ISC this year; this immediately raised my hackles, as it cast the relationship between “HPC” and “AI”/“cloud” as having some sort of adversarial tension. As the talk went on, I realized that “HPC” didn’t really mean “high-performance computing” to her. Rather, it was used to refer to something much more narrowly scoped—high-performance computing to solve scientific problems. Slide after slide, the presentation kept doubling down on this idea that “HPC” as the audience knows it is being threatened. For example, Yelick talked through this slide:The picture she painted is that “HPC” (denoted by companies with blue bars) no longer has influence over technology providers because the “hyperscalers” (green bars) have such an outsized amount of investment. She then used this to call on the audience to think about ways “we” could influence “them” to produce technologies that are useful for both scientific computing and low-precision AI workloads.Her talk culminated in this slide:Which was accompanied by this conclusion:\"So what’s a post-exascale strategic for the scientific community? It's the beat 'em or join 'em strategy. The beat 'em strategy says we’re going to design our own processors. [...] The join 'em strategy says let's leverage the AI hardware that's out there. [...] The sort of sneaky way of doing this is getting embedded in the AI community and trying to convince them that in order to make AI better for commercial AI applications, you really want to have certain features. Like don't throw away your 64-bit arithmetic and things like that.\"I found myself getting increasingly unsettled through the keynote, because this \"us versus them\" mentality put me, a long-standing member of this HPC community, in the camp of \"them.\" It was as if I was suddenly an outsider in a conference that I've been attending for years just because I no longer work for an organization that has been doing HPC since the early days of computing. Even though the clusters I support use the same NVIDIA and AMD GPUs, the same InfiniBand fabrics, and the same Lustre file systems that \"HPC\" uses, I am no longer in \"HPC\" because I am \"hyperscale\" or \"cloud\" or \"AI.\"The underlying message is one I get; GPUs are trending in a direction that favors massive gains in lower-precision computation over FP64 performance. And the cost of HBM is driving the overall value (in FP64 FLOPS per dollar) of accelerators backwards for the first time in the history of scientific computing. But the thesis that the scientific computing community needs to be sneaky to influence the hyperscale or AI players seemed way off the mark to me. What seemed absent was the recognition that many of the \"hyperscalers\" are her former coworkers and remain her colleagues, and \"they\" sit in the same audiences at the same conferences and share the same stages as the \"HPC\" community. All that is true because \"HPC\" is not somehow different than \"cloud\" or \"AI\" or \"hyperscale.\" If there really is a desire to influence the hyperscale and AI industry, the first step should be to internalize that there is no \"us\" and \"them.\"Closing keynotes on the futureJust as the conference was opened with a talk about this \"us versus them\" mentality, it was closed with a talk about \"us versus them\" in a keynote session titled, \"Reinventing HPC with Specialized Architectures and New Applications Workflows\" which had two speakers followed by Q&amp;A.Chiplets for modular HPCJohn Shalf gave one half of the closing keynote, where he gave his usual rally for investments in chiplets and specialized processors for HPC:He gives a variant of this talk at every ISC, but this year he lasered in on this notion that the \"HPC\" community needs to do what the \"hyperscalers\" do and use chiplets to develop custom ASICs. It was an energetic and impassioned talk, but this notion that hyperscalers are already executing on his idea for the future sounded a little funny to me seeing as how I now work for one of these hyperscalers and his message didn't resonate.If you really follow the money, as Shalf suggested, a huge amount of it is flowing into GPUs, not specialized processors. It wasn't clear to me what specialization he was thinking of when he referred to custom silicon being developed by the likes of Meta, Google, AWS, and Microsoft; it's true that these companies are developing their own silicon, but those efforts are largely addressing cost, risk, and supply, not improving performance beyond more general-purpose silicon like GPUs. And it turns out that a significant fraction of the (non-US) HPC community is already developing custom silicon for the same reasons as the hyperscalers; Japan, China, and Europe are all developing their own indigenous processors or accelerators for scientific computing at leadership scales. In that sense, Shalf was preaching to the choir given that, on the international stage, his government is the odd one out of the custom silicon game.He also suggested a dichotomy where the HPC community would either have to just (1) make every scientific problem an AI problem or (2) join this journey towards making domain-specific accelerators, ignoring the significant, unexplored runway offered by using mixed precision arithmetic in scientific applications. He called for partnering with hyperscalers, but his examples of implementing a RISC-V-based stencil accelerator and a SambaNova-based DFT processor didn't draw a clear line to the core missions of the large hyperscalers he extolled. He briefly said that partnering would benefit hyperscalers by addressing some capital cost challenges, but seeing as how the annual capital expenditures of the hyperscalers outstrips those of the US national HPC effort by orders of magnitude, I couldn't understand what the hyperscalers would stand to gain by partnering in this way.Integrating HPC, AI, and workflowsRosa Badia gave the second half of the closing keynote where she proposed ideas around complex scientific workflows and the novel requirements to support them. This talk felt a lot more familiar, as the focus was squarely on solving scientific computing challenges by connecting traditional HPC resources together in nontraditional ways using software whose focus goes beyond cranking out floating point arithmetic.As she spoke, I couldn't help but see parallels between the challenges she presented and the sort of technologies we live and breathe every day in cloud services.  For example, she showed this slide:Dr. Badia obviously wanted to make a cloud-tie in by calling this \"HPC Workflows as a Service,\" but what I'm not sure she realized is that this model almost exactly describes platform-as-a-service frameworks that already exist in commercial clouds. For example,What she calls a \"Data Catalog\" is a public or private object storage account (a blob container, an S3 bucket) or a PaaS abstraction built atop themWhat she calls a \"Software Catalog\" is a container registry (Azure Container Registry, Amazon Elastic Container Registry) or an abstraction built atop themA \"Workflow Description\" is something like an AzureML pipeline or SageMaker pipelineA \"Workflow Registry\" is just a Github repository containing pipelinesThe \"Portal\" is the web UI provided by AzureML or SageMakerI don't think there's anything truly new here; the challenges she described lie in wedging these workflows into HPC infrastructure which lacks the platform features like robust identity and access management (i.e., something better than LDAP that supports more modern authentication and authorization flows and finer-grained access controls) and data management (i.e., something better than a parallel file system that depends on POSIX users, groups, and permissions and implicit trust of clients).She went on to describe a workflow data management system that reinvented a bunch of infrastructure that is already baked into commercial cloud object stores like Azure Blob and AWS S3:As she was describing the requirements for such a workflow data management layer, it struck me that what the scientific data community calls \"FAIR principles\" are the same basic requirements for operating in commercial environments where data may be subject to strict privacy and compliance regulations. The notion of findable data may be aspirational for scientific datasets, but when a company is having to find datasets because it's being sued or subpoenaed, findability is a bare-minimum requirement for any data management system. Similarly, tracking the provenance of data may be a nice-to-have for scientific data, but it is a hard requirement when establishing a secure software supply chain. Cloud storage systems solved many of these challenges a long time ago, and I can't help but wonder if this idea that workflows in HPC pose a new set of challenges is another manifestation of \"us\" not realizing \"they\" might have done something useful and applicable for science.Badia's final slide had a particularly poignant statement which read, \"Systems can only be justified if we have applications that need them.\" I think she was trying to call for more investment in application development to exploit new systems, but I think the inverse is also true. If modern scientific applications truly require more complex orchestration of compute and data, maybe the scientific computing community should stop building computing platforms that make it really difficult to integrate different systems.Again, \"HPC\" is not the opposite of \"cloud;\" it's not an either/or decision. There are technologies and tools that were designed from the beginning to simplify the secure connection of services and resources; they just weren't invented by the HPC community.Top500 and AuroraOne of the cornerstones of ISC is the semiannual release of the Top500 list, and unlike at SC, the Top500 announcements and awards do not overlap with any other sessions, so it tends to have a higher profile and draw all attendees. This go-around, there were no dramatic changes in the Top 10; the new Alps system at CSCS was the only new entry, and the order of the top five systems remained the same. Notably, though, Aurora posted a significantly higher score than at SC'23 and broke through the exaflops barrier using 87% of the system, cementing its place as the second exascale system listed. But let's start at the top.#1 - FrontierFrontier at Oak Ridge remained #1, but it squeezed twelve more petaflops out of the same node count and is now just over 1.2 EF. Nothing groundbreaking, but it's clear evidence that ORNL is continuing to tune the performance of Frontier at full system scale.#2 - AuroraAurora, on the other hand, finally eked over the exaflops line with 1.012 EF using 87% of the system's total 63,744 GPUs. Rick Stevens gave a short talk about the achievement which is summed up on this slide:I was a little surprised by how honest Stevens was in this talk; the typical game that is played is that you stand up on stage, talk about how great of a partnership you had with your partners to realize this achievement, extol the virtues of the technologies on which your system was built, and talk about how this HPL score is just the start of a lot of great science.Stevens didn't do that though.He started out by telling the conference that Intel had bad product names, then explained that their low Graph500 and HPCG scores were the result of their exclusive focus on breaking the exaflops barrier with HPL, implying they didn't have time or ability to run Graph500 or HPCG at the same 87%-89% scale as their HPL and HPL-MxP runs. Based on this, it sounds like Aurora is still a ways away from being stable at scale, and we're unlikely to see any Gordon Bell-nominated papers at SC'24 this November.After this session, folks seemed to relish in dunking on Aurora; its window to be #1 is likely to have closed and it has some power efficiency issues. But I don't think anyone involved with the Aurora project needs to be told that; if what Stevens implied is true, the folks at ALCF, Intel, and HPE have been struggling for a long time now, and topping out over 1018 was a hard-sought, major milestone to be celebrated. The Aurora project has been thrown more curveballs than I would have ever guessed a single HPC project could have, so all parties deserve credit for sticking it through all this way rather than just walking away. With any luck, Aurora will stabilize in the next six months, and we'll see full-scale runs of Top500, Graph500, HPCG, and science apps by November.#3 - EagleThe third highest system on the list was Eagle, whose HPL score was not updated since the system was first listed at SC'23 last year. Through a few twists of fate, I wound up being the person who accepted the award on-stage, and I now have a Top500 award for the #3 system sitting in my home office. Here's a photo of me goofing around with it:It's not entirely inappropriate that I was the one to accept it since my teammates are the ones carrying pagers for the on-call rotation of that system, and we were also the hands-on-keyboard when that HPL run was conducted. Still, it was a bit surreal to walk on-stage to pick up such a noteworthy award immediately following two actually important people (both of whom have \"director\" in their titles) accepting the same award. By comparison, most of my career highlights to date have been just trolling HPC people on Twitter (as the esteemed Horst Simon actually said out loud as I was leaving the stage!)It was weird.That said, I take this to mean that it is now my duty to be the friendly face from Microsoft who can speak intelligently about the #3 system on Top500. To that end, I'll answer some questions that I was asked at ISC about the system and Azure HPC clusters in general below. None of this is new or secret information!Why didn't you run HPL again and post a higher score to beat Aurora? Because the day after that HPL run completed, that system was put into production. Once systems are in production, people are paying to use them, and taking a time-out to re-run HPL costs a ton of money in either real dollars (if a customer runs it) or lost revenue (if the HPL run is blocking customer workloads). This is quite different from public-sector HPC systems which never have to pay for themselves.Can I get access to Eagle for a Gordon Bell run or to test software? That's not really how it works. Whereas a traditional supercomputer might allow users to ssh in and submit jobs to a Slurm queue, cloud-based supercomputers allow users to deploy virtual machines through a REST API. Those virtual machines can allow ssh, run Slurm, and support MPI jobs like HPL, but that OS environment is managed by Azure users, not Azure itself. You can get a taste for what's required to run a basic MPI job by reading some instructions I wrote on provisioning an MPI cluster on Azure.Is it just a bunch of GPU nodes scattered around a bunch of data centers? No, all the nodes on any given Azure HPC cluster (like Eagle) share an InfiniBand fabric. There are countless InfiniBand clusters in Azure, but each one is a real supercomputer by any definition of a supercomputer, and they are designed to run tightly coupled job across all their GPUs.What parallel file system does it use? Don't think about it that way. You can provision a Lustre file system and mount that to any or all cluster nodes if you want to, or you can access data directly from object storage.Are there any photos of it? You can see a photo of one of the Microsoft-designed nodes that comprise the system on my SC'23 recap blog post. Beyond that, there's not much to look at because Azure HPC clusters are not meant to be photogenic like, say, Cray supercomputers. There's no rack graphics (or even rack doors!). It's just tons and tons of air-cooled racks with InfiniBand optics coming out of each one. Maybe the only unique thing is that the racks are painted white instead of the typical black. Not sure why.Getting back to that false separation between \"HPC\" and \"cloud,\" Eagle is strong evidence that they aren't different. What the \"hyperscalers\" do is not that different from what traditional HPC centers do. Perhaps the biggest difference is that cloud supercomputers get all the benefits of cloud infrastructure like software-defined infrastructure like virtual machines and virtual networking, integration with identity and access management that transcends simple Linux UIDs/GIDs, and the flexibility to integrate with whatever storage systems or ancillary services you want from any compute node.Other notable tidbitsIt is tradition for Erich Strohmaier to talk through some highlights and trends of the latest Top500 list every time a new one is announced, and in the past, I've been critical of how he's presented conclusions from the list with this implicit assumption that computers that never post to Top500 simply don't exist. This year felt different, because Dr. Strohmaier made the explicit statement that China has completely stopped submitting to Top500. Their exascale systems aren't listed, but neither are any new systems in the past three years at the bottom. They simply don't play the game anymore, making it undeniable that Top500 is no longer an authoritative list.Just as the whole conference's theme was reinventing HPC, I felt a sense that even the most stalwart proponents of Top500 are now recognizing the need to reinvent the Top500 list. Kathy Yelick said as much during her keynote (\"Shall we replace Top500? What are the metrics in post-exascale computing that are important?\"), and Erich implored the audience to help expand the HPL-MxP (formerly HPL-AI; an HPL-like benchmark that can use the mixed-precision capabilities of tensor cores) list. Nobody seems to know how to quantify what makes a leadership supercomputer nowadays, but accepting that HPL scores (or appearing on the Top500 list!) won't cut it is a good first step.That all said, Top500 is still a valuable way to track technology trends in the industry. For example, this edition of the list where NVIDIA's new Grace-Hopper node started appearing in force. The only new entrant in the Top 10 was the 270 PF GH200 component of CSCS's Alps system, and HPEhad these EX254n GH200 blades on display on the show floor.To HPE/Cray's credit, they seem to have gotten the system up and running with Slingshot without the delays that plagued early Cray EX systems like Frontier and Aurora. Hopefully this is a sign that the Cray EX platform and Slingshot-11 have graduated from being risky and not-quite-production-ready.The other notable entrants on this year's Top500 are a trio of early MI300A APU-based Cray systems being built around the El Capitan program at Lawrence Livermore National Laboratory. This is a positive sign that MI300A is up and running at modest scale, and HPE also had one of these EX255a blades on display at their booth:The strong showing of MI300A suggests that we may see El Capitan take the top spot in the next edition of the Top500 list coming in November.Everyone is an AI expert!Since I now work on a team responsible for AI infrastructure, I tried attending as many of the AI-focused talks and panels as I could this year. Unsurprisingly, these sessions largely carried the same undertones of \"reinventing HPC,\" and speakers opined on how AI would affect scientific computing and offered examples of what their institutions were doing to extend their leadership in the HPC space into the AI space. There was a fair amount of grasping going on (as there always is when AI is discussed at non-AI conferences), but this year I was struck by how confused so many speakers and attendees were about concepts related to applying AI.To be clear: I am no expert in AI. However, my day job requires that I be steeped in some of the largest AI training workloads on the largest AI supercomputers on the planet, and I have to have a cursory understanding of the latest model architectures and techniques to anticipate how future system designs will have to evolve. It's from this perspective that I made the following observation: there are a lot of HPC people speaking very confidently about AI based on an outdated understanding of the state of the art. The AI industry generally moves much faster than the government-funded research community, and I couldn't help but wonder if some community leaders assumed that the AI industry today is the same as it was the last time they wrote their AI grant proposal.Of course, there were also some really insightful perspectives on AI for science shared as well. Let's talk through some examples of both.The Exascale AI Synergies LLM Workflows BOFThis realization that the ISC community is not keeping up with the AI community first slapped me in the face when I ducked into a BOF session titled, \"Tales of Exascales – AI and HPC Supercomputing Platforms Synergies for Large Language Models (LLMs) and Scientific Workflows.\" I sometimes wonder if the organizers who propose titles like that are intentionally creating word salad, but in this case, it was apt session name; the discourse around HPC and AI was all over the board throughout the hour.The session started on a strong, positive note by Simon McIntosh-Smith describing Bristol's new Isambard-AI system, a GH200-based Cray supercomputer funded under the broad charge of \"AI research.\" While I'm usually skeptical of such nebulously defined \"AI research\" machines, Dr. McIntosh-Smith's description of the project quickly checked a bunch of boxes on how a real AI research platform should be developed. In particular,Isambard-AI was developed and deployed at the pace of AI rather than HPC for scientific computing. Whereas government-funded, large-scale HPC systems typically take years to procure, Simon said that the first discussions started in August 2023, and in the nine months that followed, they had built the site, the team, and the system itself to the degree that a piece of the final system is already on Top500. By comparison, LLNL's El Capitan supercomputer also debuted on Top500 this month, but its contract was signed five years ago, and its procurement began at least two years before that. The AI industry would not exist if the systems it trains on took seven years to procure.Isambard-AI deliberately avoided exotic AI accelerators to remain future-proof. Simon rightly pointed out that the AI industry moves too quickly to anticipate whether a bespoke AI accelerator would even be relevant to whatever the hottest model architecture will be in a year. GPUs were chosen because they are the most flexible way to accelerate the widest range of AI workloads, regardless of if they are dense models, sparse models, inferencing, training, and whatever level of quantization makes sense. The reality is that cutting-edge research is done on GPUs, so aligning an AI supercomputer on the same technology will ensure that the algorithms developed by industry are immediately usable for scientific research.A reasonable definition of \"AI for science\" was defined from the outset. Rather than blurting out \"we need to research AI!\" and asking for a sack of money to buy GPUs, Simon outlined a vision of training AI models using data generated by physical simulation on a more conventional HPC system. Training models on models to create surrogate models is not particularly new, but it does establish a few reasonable architectural decisions such as having a robust data management and sharing platform, close coupling to the HPC system performing simulation, and aligning software stacks and programming environments as closely as possible.Simon's contribution to the discussion stood out to me as the most impressive, and the discourse seemed to fall into a trap of familiarity following. Rather than focusing on the new and exciting prospects of AI, some panelists and audience members wanted to focus on the aspects of AI they understood. For example, an uncomfortable time was spent on a back-and-forth on how HPC centers can support Kubernetes and random I/O (which is what defines AI vs. HPC?) instead of Slurm and Lustre. If your biggest challenge in delivering infrastructure to support AI workloads is figuring out how to deploy both Kubernetes and Slurm, you haven’t even reached the starting line. This is a trivial issue in cloud environments, where entire AI clusters can be built up and torn down in minutes. Again, this is evidence that the scientific computing community isn’t ready to keep pace with the AI industry.I jotted down a few of the questions and comments that I heard during this BOF that seem to reflect the level of familiarity the average ISC attendee has with AI:\"Would be nice if there were more models for science.\" I wasn't sure sure what this means. All the leading LLMs are pretty good at \"science,\" and domain-specific models aren't readily transferable between different science domains or problems.Scientific problems \"have to validate outputs for correctness, unlike LLMs.\" I think the speaker was making a sidelong reference to hallucinations, but like with any model (large language or physics-based), validating outputs for correctness is certainly necessary and readily possible.\"The demands of inference of LLMs are completely different from those for training. How do you buy inference infrastructure?\" I wonder where this notion came from. If your infrastructure can train a model, it can definitely inference that model. Cost-optimizing infrastructure for inferencing is a separate matter (you can cut corners for inferencing that you wouldn't want to cut for training), as is building the service infrastructure around inferencing to deliver inferencing as a service. But I don't think that's what this question was about.\"Working safely with sensitive data / isolating workloads on big shared clusters.\" This is a problem that arises only when you try to wedge AI workloads into infrastructure designed for traditional physics-based simulation. If you have sensitive data, don't use big shared clusters. Provision separate clusters for each security domain on a shared, zero-trust infrastructure.\"How different are the files and filesystem access while training for LLMs, image generation models, reinforcement learning?\" This question reflects a general misunderstanding of data and storage in HPC overall; how data is organized into files and how that data is accessed by a workload is an arbitrary decision made by the application developer. You can organize piles of text into one giant file or a million little files.There were a few questions that came up that touched on deeper issues on which the HPC community should reflect:\"What are the first steps for scientific groups wanting to get ready for using AI in the future?\" This is probably the purest question raised in the entire session, and I think this is something the scientific computing community as a whole needs to figure out. What does \"using AI\" really mean for scientific groups? Is it training models? Fine-tuning models? Inferencing using pre-trained models on HPC infrastructure? Is it integrating simulation applications with separately managed inferencing services? Who manages those inferencing services? Does inferencing even require HPC resources, or can suitable models run on a few CPU cores? I think the first step to answering this question is ensuring that the scientific computing community reaches a common baseline level of understanding of \"using AI\" means. And a lot of that probably means ignoring what some self-professed AI experts in the HPC community claim is the future.\"Care to predict what that ChatGPT moment will be for AI for Science? Had it already happened?\" This question was addressed directly by panelist Séverine Habert who rightly pointed out that the ChatGPT moment occurred when a complex and esoteric topic was suddenly put in the hands of hundreds of millions of laypeople across the world. It was the moment that the common person walking on the street could suddenly interact with the most cutting-edge technology that had been previously understandable only to the headiest of researchers in industry and academia. That will likely never happen in AI for science because science, by definition, requires a higher baseline of education and understanding than the average layperson has.\"How to effectively train the existing workforce when we are already struggling to retain talent in research/academia?\" This question strikes at the same theme that Kathy Yelick's opening keynote confronted: what is the role of the scientific computing community now that it turns out that you don't need decades of institutional experience to deploy and use HPC resources at leadership scale? As offensive as it may sound, perhaps the public-sector HPC community should accept that their role is not training future researchers and academics, but training future practitioners of AI in industry. This is how the wider tech industry generally works; neither startups nor tech giants make hires assuming those people will still be around in ten years. Why does the public-sector HPC industry think otherwise?Finally, I was also struck but how fiercely the discourse clung to the idea that large language models are the answer to all AI problems in science. I get that this panel was focused on exascale, and LLM training is one of the rare cases where AI requires exascale computing capabilities. But there was no acknowledgment that trillion-parameter models are not actually a good idea for most scientific applications.AI Systems for Science and ZettascaleThis singular focus on creating massive LLMs for science was front-and-center in a talk given by Rick Stevens titled \"The Decade Ahead: Building Frontier AI Systems for Science and the Path to Zettascale.\" The overall thesis that I heard was something like...Science needs its own trillion-parameter foundation modelsTraining trillion-parameter foundation models requires a lot of GPUsWe need $25 billion from the U.S. governmentHowever, Stevens never answered a very basic question: what does a foundation model for science do that any other foundation model cannot do?He showed slides like this which really don't sound like foundation models for science as much as a generic AI assistants:Is the scientific computing HPC community really the most qualified bunch to reinvent what existing foundation models like GPT-4 or Claude 3 have already done? Even if you argue that these proprietary models aren't as good at \"science\" as they could be, who would have a better chance of addressing this with a billion dollars of federal funding: the companies who developed GPT or Claude, or a collection of government scientists starting from scratch?I think the answer to this question was in other parts of Stevens' talk. For example, he started with this slide:While robust requirements are good when there's no urgency, this slide is also a tacit admission that the government takes years to general a perspective on AI. Do you think the creators of Llama-3 or Mistral Large gathered wide community input from over 1,300 researchers before deciding to build a supercomputer and train a model? Even if science needs its own foundation models, this slide is strong evidence that, by the time the scientific HPC community agrees on a path forward, that path will be years out of date relative to what the commercial AI industry is doing.A great example of this already happening is the basic premise that creating a foundation model with a trillion parameters is the best way to apply AI to solve science problems. This certainly was the leading thought two years ago, when transformer scaling laws were published that suggested that the best way to get better-performing LLMs was to simply add more parameters to your transformer and train on more data. But there's a reason all the leading models have stopped advertising how many parameters they use.Dealing with massive transformers is really expensive. They're not only really expensive to train, but they're really expensive to use for inferencing too. This has led to a bunch of innovation to develop model architectures and approaches to training that result in dramatically higher quality outputs from a fixed parameter count. Dense transformer architectures with a trillion parameters have become the blunt instrument in developing foundation models since 2022, so it took me by surprise to hear Stevens put so much stock into this notion that the need for a trillion-parameter model is essential for science.To repeat myself, I am no expert in AI. I've never been called in front of Congress to talk about AI or been invited to give talks on the topic at ISC. There might be something basic that I am missing here. But when I look at the science drivers for AI:I know that you do not need to train your own trillion-parameter model to do most of this stuff. Even the use cases that do require generative AI, like code generation and math theory, don't actually require trillions of parameters. Small language models, such as that described in Textbooks Are All You Need (published in 2023, after the reports Stevens cited in his talk), can produce amazing results with very small models when you train them using high-quality data instead of garbage from Reddit. And when you create or fine-tune a small language model for a specific science domain, not only do you save yourself from having to buy a billion-dollar supercomputer for training, but you get a model that is much more accessible to scientists around the world because they won't need a million dollars' worth of GPUs to inference with it.So, if there's one question that was never answered across any of the AI-themed sessions at ISC this year, it is this: Why does science need to train its own large language models? My intuition is that either fine-tuning existing large language models or training small language models for domain-specific applications, would be a better investment in actually advancing science. However, if we cynically assume the real goal of LLMs-for-science is to justify buying massive GPU systems, suddenly a lot of the talks given at ISC on this topic make a lot more sense.Real applications of generative AI for scienceAs frustrated as I got sitting through sessions on AI where it sometimes felt like the blind leading the blind, there was one really good session on actual applications of generative AI for science.Mohamed Wahib of RIKEN gave an insightful presentation on the unique challenges of using generative AI in science. His summary slide touched on a lot of the key challenges:And his actual talk focused largely on the model and data aspects of generative AI. What struck me is that the challenges he described reflected the experience of someone who has actually tried to do what many other AI experts at the conference were claiming would be the future. For example,He recognized the importance of training scientific models with high-quality datasets, not just garbage scraped off of social media. This means not only scraping or generating high quality data for training, but curating and attributing that data and applying reinforcement learning with human feedback as the model is being trained. This is uniquely challenging when creating models for scientific applications, as managing the quality of scientific data requires deep domain expertise. This contrasts with a generic chat bot whose inputs and outputs can often be assessed by anyone with a basic education.He also talked about the tendency of scientific data to be highly multimodal and multidimensional. Whereas multimodal chatbots may combine text and vision, scientific data often contains observations of the same phenomenon from many different sensors (for example, pressure, temperature, density, strain fields, ...), and the output of a generative model for science may require multiple modalities as well.  These capabilities are not well developed in LLMs designed for human language.Dr. Wahib also pointed out that scientific datasets tend to be huge compared to text and images, and this may require developing ways for models to have context windows can fit multi-petabyte datasets' tokens to identify long-range correlations. Relatedly, he also pointed out that tokenization of scientific data is a new set of challenges unique to this community, since industry has been focused on tokenizing low-dimensional data such as text, audio, and images.The good news is that industry's quest towards both commercializing generative AI and achieving AGI will touch on some of these challenges soon. For example, training domain-specific models using high-quality datasets is an essential component of the small language models I described in the previous section, and these small language models are what will enable privacy-preserving and cost-effective generative AI on laptops and phones. Effectively infinite context windows are also a major hurdle on the path to AGI, as industry is hard at work developing AI agents that can remember every conversation you've ever had with them. Finding more scalable approaches to attention that do not sacrifice accuracy are a part of this.François Lanusse, currently at the Flatiron Institute, also gave a nice presentation that clearly explained how generative AI can be used to solve inverse problems—that is, figuring out the causes or conditions that resulted in a collection of measurements. A precise example he used applied generative AI to figure out what an image distorted by gravitational lensing might look like in the absence of those distortions. As I understood it, he trained a diffusion model to understand the relationship between images that are affected by gravitational lensing and the masses that cause lensing through simulation. He then used that model instead of an oversimplified Gaussian model as part of a larger method to solve the inverse problem of un-distorting the image.The details of exactly what he did were a little over my head, but the insight piece for me is that combining generative AI and science in practice is not as straightforward as asking ChatGPT what the undistorted version of a telescope image is. Rather, almost all of the standard, science-informed approach to solving the inverse problem remained the same; the role of generative AI was simply to replace an oversimplified part of the iterative process (the Annealed Hamiltonian Monte Carlo method) to help it converge on better answers. It really is a combination of simulation and AI, rather than an outright substitution or surrogate model.Dr. Lanusse also showed this slide which demonstrated how this approach can be generalized to other scientific domains:The general approach of pretraining, fine-tuning (\"adapt\"), and combining foundation models with other physics-based models seems reasonable, although I admit I have a difficult time wrapping my head around exactly how broadly scoped he envisions any given pretrained foundation model to be. I can see such a model trained on extensive sky survey data being useful for a number of astrophysical and cosmological tasks, but it's less clear to me how such a model might be useful in unrelated domains like, say, genomics.You might also ask why I think this vision of foundation models for science is reasonable while Rick Stevens' vision didn't ring true; the difference is in scale! The foundation models cited on Lanusse's slide are vision transformers which have many orders of magnitude fewer parameters than the trillion-parameter models that others talk about. Whereas a trillion-parameter model might need to be distributed over dozens of H100 GPUs just to produce one inference result, the largest of the vision transformers can probably be squeezed on to a single high-end desktop GPU. Again, you don't need billion-dollar supercomputers to train these models for science.Frank Noé from Microsoft Research then talked about how generative AI can be applied to solve problems in simulating biological systems. Like the talk before his, Dr. Noé followed this pattern where a larger, physics-based framework had one statistical technique replaced by a method based on generative AI, and then a physics-based model is used to quantify the likelihood that the result is reasonable. He contrasted this with convention approaches (to, say, protein folding) where you just simulate for really long times in the hopes that your simulation randomly wanders into a situation where you capture a rare event.His talk wasn't about generative AI as much as the previous speakers, but he offered a litany of ways in which AI models can be useful to molecular modeling:Markov state models provide a statistical framework that lets you replace one long simulation (that hopefully captures every possible scenario) with a bunch of short, chopped-up simulations that hopefully capture every possible in parallel. He cited an example that took 20,000 GPU-days on V100 GPUs that would've otherwise taken a million GPU-years if done in one long simulation.Coarse-grained models use machine learning to develop surrogate models to simulate the physics of relatively uninteresting parts of molecular systems. The example he used was simulating the water molecules surrounding a biomolecule; water can be very difficult to accurately model, and the example he cited led to a surrogate model that was 100x faster than directly simulating water molecules.Boltzmann generators can generate 3D molecular structures based on a known probability distribution defined by the energy states of the system. This is another fast way to find rare but stable molecular configurations without having to throw darts at a dartboard.What struck me is that, in all these cases, the AI model is never generating results that are blindly trusted. Instead, they generate molecular configurations which are then fed into physics-based models which can quantify how likely they are to be valid.Both Lanusse's and Noé's examples of combining AI and simulation painted a picture to me where generative AI can be really useful in solving problems where a researcher would otherwise have to make educated guesses about what physical phenomenon is really happening based on incomplete information. So long as there is a way to apply a physics-based model to check the accuracy of each guess, generative AI can be trained to predict the relationships between incomplete information and what's really going on and get to probable answers much faster than relying on physics alone.More broadly, I couldn't help but think about the Sora video showing pirate ships battling in a cup of coffee as I left this session. Like that video, these talks demonstrated that it's possible to train generative AI models to reproduce physical phenomena (like the fluid dynamics of coffee) without explicitly embedding any laws of physics (like the Navier-Stokes equations) into the model itself and still get really compelling results. The part of this that was lacking from the Sora video—but was present in these talks—was closing the loop between generated results and the laws of physics by feeding those generated results back into the laws of physics to figure out if they are probable.High Performance Software FoundationISC'24 wasn't all about AI though! I wound up attending the launch of the High Performance Software Foundation (HPSF), a new Linux Foundation effort spearheaded by Todd Gamblin and Christian Trott (from Livermore and Sandia, respectively) aimed to promote the sustainability of the software packages relied upon within the high-performance computing community.I haven't paid close attention to HPC software in a long time since most of my work was in platform architecture and storage systems, so a lot of the background context remains a little murky to me. That said, it seems like HPSF was formed to be like the Cloud Native Computing Foundation for the HPC community in that:it will serve as a neutral home for software projects that aren't tied to any single university or government institutionit provides mechanisms to ensure that critical HPC software can continue to be maintained if its original author gets hit by a busit will help with the marketing, promotion, and marketing of HPC softwareIts governance seems pretty reasonable, with different levels of membership being accompanied by different levels of rights and obligations: There is a Governing Board is comprised of paying members (and predominantly those who pay the most), while the Technical Advisory Council carries out the more technical tasks of forming working groups and onboarding projects.There are three levels of membership, and the highest (premier) has a $175,000 per year buy-in and comes with a seat on the Governing Board. Right now, the founding seats are held by AWS, HPE, LLNL, and Sandia.Below that is a general membership tier whose cost is on a sliding scale based on the organization size, and AMD, Intel, NVIDIA, Kitware, ORNL, LANL, and Argonne have all committed at this level.  The associate tier is below that, and it is free to nonprofits but comes with no voting rights.It seemed like the exact functions that HPSF will have beyond this governing structure are not fully baked yet, though there were six \"prospective\" working groups that provide a general scope of what the HPSF will be doing:My read of the description of these working groups is thatCI/testing will supply resources (GPUs) on which HPSF projects' code can be automatically tested.Software stacks will maintain E4S.User engagement sounds like it will figure out what users of HPSF projects' software are looking for. It sounds like this will provide some product management-like support for projects.Facility engagement is probably like user engagement, but for the sites deploying code on behalf of their users. Again, this sounds like product management functions.Security sounded like stewarding SBOM-like stuff for member projects' software.Benchmarking would make a framework for benchmarking HPC applications.That all said, it still wasn't clear what exactly HPSF would do; what would all those membership dues go towards supporting? Based on some Q&amp;A during this BOF and follow-up afterwards, I pieced together the following:HPSF will not be funding developers, much in the same way that OpenSFS doesn't fund Lustre development. That said, Todd Gamblin later said that not funding software development was a financial constraint more than a policy one, with the implication that if more members join, there may be opportunity for HPSF to fund projects.HPSF likely will be hosting events and conferences (perhaps like the CNCF hosts KubeCon), providing scholarships, developing and providing training related to member projects, and \"increasing collaboration\" (whatever that may mean!).HPSF also has some influence and ownership over its member projects:HPSF will co-own its projects' GitHub repos to ensure continuity in case the other repo owner abandons it.HPSF will own the domain for the project for the same reasons as above.Member projects still manage their own software development, roadmaps, releases, and the like. The HPSF won't dictate the technical direction of projects.HPSF will own the trademark and logos of its member projects so it can prevent corporations from profiting off of repackaging products without respecting trademark.This establishes an interesting new direction for the sorts of software projects that are likely to become member projects. Historically, such projects developed by the member organizations (i.e., DOE labs) have been wholly controlled by the labs that funded the work, and those software projects lived and died at the whims of the government funding. The HPSF offers a new vehicle for software projects to live on beyond the end of the grants that created them, but at the same time, it requires that the DOE surrender control of the work that it sponsored.I left the session still wondering a few pretty major things, likely borne out of my own ignorance of how similar organizations (like CNCF or the Apache Foundation) work:How does a software project actually become a member project? The HPSF folks said that the Technical Advisory Committee onboards new projects, but what is the bar if I have an open-source project used by the community that I no longer want to maintain myself? I assume it's not a pay-to-play arrangement since that defeats the purpose of sustaining software after its seed funding runs out.What do stakeholders actually get out of joining HPSF? I see obvious value for organizations (like the DOE labs) who develop open-source software but may not want to be exclusively responsible for sustaining it forever. But would an HPC facility get any obvious benefit from joining and paying dues if it is simply a consumer of member projects' software? What does a cloud vendor like AWS get by being a premiere member? Is HPSF just a way to get someone else to cover the overheads of maintaining open-source software that comes out of, say, R&amp;D organizations rather than product organizations?Hopefully the answers to these questions become clearer as the foundation gets off the ground and we get to see what member organizations contribute under the HPSF banner.Ultimately though, I see this as a really positive direction for the HPC software community that might help resolve some uncertainty around key pieces of HPC software that have uncertain ownership. For example, I wound up as a maintainer of the IOR and mdtest benchmark because I was the last one to touch it when its previous maintainer lost interest/funding. I don't even work in I/O performance anymore, but the community still uses this benchmark in virtually every procurement of parallel file systems either directly or through IO500. It would be wonderful if such an important tool didn't rest on my shoulders and had a more concrete governance structure given how important it is.Quantum computingBesides AI and cloud, quantum computing was cited in Kathy Yelick's opening keynote as the third disruptor to HPC for scientific computing. At the time, I thought citing quantum was just an obligation of any opening keynote speaker, but quantum computing was particularly high-profile at ISC this year. I was surprised to see over a dozen quantum computing companies on the vendor exhibition floor, many of whom were Europe-based startups.In addition, this year's Hans Meuer award (for best research paper) was given to a paper on quantum computing by Camps et al. This is particularly notable since this is the first time that the Meuer award has ever been given to a paper on a topic that isn't some hardcore traditional HPC like MPI or OpenMP advancements; by comparison, this award has never been given to any papers on AI topics. Granted, the winning paper was specifically about how to use conventional HPC to solve quantum problems, but this recognition of research in quantum computing makes a powerful statement: quantum computing research is high-performance computing research.Reinvent HPC to include urgent computing?I was invited to give a lightning talk at the Workshop on Interactive and Urgent High-Performance Computing on Thursday, and urgent/interactive HPC is not something I'd really paid attention to in the past. So as not to sound like an ignorant fool going into that workshop, I opted to sit in on a focus session titled \"Urgent Computing\" on Tuesday. I had two goals:Make sure I understood the HPC problems that fall under urgent and interactive computing so I could hold an intelligent conversation on this topic at the Thursday workshop, andSee if there are any opportunities for cloud HPC to provide unique value to the challenges faced by folks working in urgent HPCI'll describe what I came away with through these lenses.The Urgent Computing focus sessionWhat I learned from the focus session is that urgent computing is not a very well-defined set of application areas and challenges. Rather, it's another manifestation of reinventing HPC to include any kind of computation for scientific purposes.Much to my surprise, this \"Urgent Computing\" focus session was actually a session on IoT and edge computing for science. Several speakers spoke about getting data from edge sensors on drones or telephone poles into some centralized location for lightweight data analysis, and the \"urgent\" part of the problem came from the hypothetical use cases of analyzing this sensor data to respond to natural disasters. There wasn't much mention of anything requiring HPC-like computing resources; at best, a few talks made unclear references to using AI models for data analysis, but it felt like grasping:The above conclusion slide was presented by one of the speakers, and to be honest, I don't understand what any of it means. Granted, I know very little about urgent computing, IoT, or edge computing so there may be some domain jargon here that's throwing me off. But based on this, as someone working in the area of HPC and AI in the cloud, I don't think I have a role to play here. I'm sure cloud computing can help, but the challenges would be in general-purpose cloud rather than HPC.The Interactive and Urgent HPC workshopFortunately for me, the Thursday workshop on Interactive and Urgent HPC was much less about edge/IoT and more about developing software infrastructure and workflows that allow scientific data analysis of large datasets to happen before the results become obsolete. It was a fascinating workshop for learning about specific science drivers that require fast access to HPC resources, and how different HPC providers are enabling that through non-traditional services and policies. Below are a few highlights.Sam Welborn (NERSC) presented his team's efforts to convert a streaming data workflow from its current file-based approach into one that streamed directly into compute node memory. The specific use case was the initial data processing for image information coming off of a scanning transmission electron microscope at 480 Gbps, totaling 750 GB per shot. As he described it, the current technique involves streaming those data to files at the microscope, then copying those files to the parallel file system of a remote supercomputer, then reading, processing, and writing that data within the HPC environment to prepare it for downstream analysis tasks. And for what it's worth, this is how I've always seen \"streaming\" HPC workflows actually work; they're actually using file transfers, and the performance of both the file system at the source and destination are in the critical path.The problem with this approach is that parallel file systems on HPC systems tend to be super flaky, and there's no real reason to bounce data through a storage system if you're just going to pick it up and process it. So, Dr. Welborn showed a true streaming workflow that skipped this file step and used ZeroMQ push sockets at the microscope and pull sockets on the HPC compute nodes to do a direct memory-to-memory transfer:Seeing software like ZeroMQ used to enable communication in an HPC environment instead of forcing this workflow to fit into the MPI paradigm is an encouraging sign in my eyes. ZeroMQ, despite not using purpose-built HPC technology like RDMA, is the right tool for this sort of job since it supports much better resilience characteristics than messaging libraries designed for tightly coupled HPC jobs. Workflows like this that combine beefy GPU nodes with software developed in the commercial tech space suggest that the world of HPC is willing to abandon not-invented-here ideology.It wasn't clear to me that there's a great opportunity for cloud HPC to be uniquely useful in use cases like this; while you certainly can provision beefy CPU and GPU nodes with InfiniBand in Azure, cloud services can't obviously simplify this ZeroMQ-based workflow beyond just supplying general-purpose VMs on which the orchestration services can run. Had this team stuck with a file-based streaming mechanism, the performance SLAs on cloud storage (like object or ephemeral Lustre) would provide a more reliable experience to ensure the data transfer happened in near-real-time. But the better solution to unpredictable file system performance is to do exactly what was done here: skip the file system entirely.Just to keep the speaker honest, I asked why this computation couldn't simply be done at the same place as the telescope generating the data. After all, if the telescope always generates 750 GB per shot, you should be able to buy a couple GPU servers that are ideally sized to process that exact workload in the time between images. There were actually two answers: one from Sam and one from an audience member:Sam said that you can process this workflow locally, but that the goal of this work was to prepare for a future microscope (or another instrument) that could not. He also insightfully pointed out that there's tremendous value in getting the data into the HPC environment because of all the services that can be used to work on that data later. I envisioned doing things like using a Jupyter notebook to further process the data, serve it up through a web UI, and similar tasks that cannot be done if the data is stuck inside a microscope room.An audience member also pointed out that sticking GPU nodes in the same room as electron microscopes can result in enough noise and vibration to disrupt the actual scope. This was a great point! In the days before I started working in HPC, I was training to become an electron microscopist, and I worked in a lab where we had water-cooled walls to avoid the problems that would be caused by air conditioning breezes. There's no way a loud server would've worked in there.Toshio Endo (Tokyo Tech) gave an interesting talk on how they enable urgent/interactive compute jobs on their batch-scheduled TSUBAME4.0 supercomputer by doing, frankly, unnatural things. Rather than holding aside some nodes for interactive use as is common practice, his work found that a lot of user jobs do not completely use all resources on each compute node they reserve:I had to do a double-take when I saw this: even though 65%-80% of the nodes on the supercomputer were allocated to user jobs, less than 7% of the GPUs were actually being utilized.Dr. Endo's hypothesis was that if nodes were suitably subdivided and jobs were allowed to oversubscribe CPUs, GPUs, and memory on a compute node without impacting performance too much, they could deliver real-time access to HPC resources without having to create a separate pool of nodes only for interactive uses. He defined success as the slowdown of a shared job being 1/k if k jobs shared the same node; for example, if four jobs were all running on the same node, each one taking four times as long to complete would be acceptable, but any longer would not. He then went on to show that the best way to accomplish this is using Slurm's gang scheduling, where each job takes turns having exclusive access to all the CPUs and GPUs on a node. The alternative (just letting the OS context switch) was no good.While a fascinating study in how to provide zero wait time to jobs in exchange for reduced performance, this whole mechanism of using gang scheduling to exploit low resource utilization seems like jamming a square peg into a round hole. If a workload doesn't (or can't) use all the GPUs on a node, then that's not the right node for the job; I feel like a more appealing solution would simply be to offer a heterogeneous mix of nodes based on the demands of the workload mix. This is hard to do if you're buying monolithic supercomputers since you're stuck with whatever node mix you've got for five years, but there is another way to buy supercomputers!I won't pretend like dynamically provisioning different flavors of CPU- and GPU-based nodes interconnected with InfiniBand in the cloud doesn't come with a cost; the convenience of being able to slosh a cluster makeup between CPU-heavy and GPU-heavy nodes will be more expensive than committing to use the same makeup of node flavors for multiple years. But if you're paying for GPUs that are only being used 7% of the time, surely it's cheaper to pay a higher cost for GPUs when you need them if it also allows you to not pay for them 93% of the time when they're idle.Bjoern Enders (NERSC) gave the first lightning talk where he presented the exploration they're making into enabling real-time and urgent computation. They're currently going in three parallel directions to provide this capability:Reservations, a process by which a user can request a specific number of nodes for a specific period of time, and Slurm ensures that many nodes are available for the exclusive use of that user by the time the reservation starts. He said that implementing this at NERSC is costly and rigid because it requires a human administrator to perform manual steps to register the reservation with Slurm. Realtime queues, where a few nodes are held from the regular batch queue and only special real-time users can submit jobs to them. Dr. Enders said that NERSC is extremely selective about who can access this queue for obvious reasons: if too many people use it, it will back up just like the regular batch queues do.Jupyter Hub, which utilizes job preemption and backfill under the hood. If a user requests a Jupyter job, Slurm will pre-empt a job that was submitted to a preemptible queue to satisfy the Jupyter request. However, if there are no preemptible jobs running, the Jupyter job will fail to launch after waiting for ten minutes.To provide compute resources to back up these scheduling capabilities, they also deployed a new set of compute nodes that can be dynamically attached to different supercomputers they have to support urgent workloads even during downtimes.  Called \"Perlmutter on Demand\" (POD), it sounded like a separate set of Cray EX racks that can be assigned to either the Perlmutter supercomputer, or if Perlmutter is down for maintenance, either their smaller Alvarez or Muller supercomputers which share the same Cray EX architecture. What wasn't clear to me is how the Slingshot fabrics of these nodes interact; perhaps POD has its own fabric, and only the control plane owning those racks are what changes.He showed a slide of explorations they're doing with this POD infrastructure, but as with Dr. Endo's talk, this seemed a bit like a square peg in a round hole:All of this sounds aligned with the strengths of what HPC in a cloud environment can deliver, and some of the big challenges (like figuring out the ideal node count to reserve for interactive jobs) are problems specific to Slurm and its mechanism for scheduling. There's a lot more flexibility to rapidly provision HPC resources in cloud environments because, unlike the case where Slurm is scheduling jobs on a single cluster, cloud resource managers can schedule across any number of clusters independently. For example, if an urgent workload needing only four GPU nodes suddenly appears, it doesn't necessarily have to be scheduled on the same InfiniBand fabric that a large hero job is running on. Since the urgent job and the hero job don't need to talk to each other, cloud resource managers can go find a GPU cluster with a little more flex in them to provision those resources quickly.Automating the process of reservations is also a bit of a game of catch-up, though my guess is that this is more a matter of someone having a weekend to sit down and write the REST service that manages incoming reservation requests. Although there's not a direct analog for reservations like this in Azure, AWS has a feature called AWS Capacity Blocks that does exactly this: if you know you'll want a certain number of GPU nodes sometime in the future, Capacity Blocks let you reserve them ahead of time through an API.Finally, I represented Microsoft and gave a lightning talk that riffed on a lot of what I've been writing about in this blog post: HPC seems to be reinventing a lot of things that the cloud has already figured out how to do. The illustrious Nick Brown was kind enough to snap a photo of one of my slides and post it on Twitter:My thesis was that the way urgent HPC workflows are triggered, scheduled, run, and reported on follows the same pattern that inferencing-as-a-service services (like Copilot and ChatGPT) are implemented under the hood, right down to executing multi-node jobs on InfiniBand clusters. The difference is that these cloud workflows are built on the foundation of really nice cloud services that provide security, scalability, monitoring, and hands-free management that were originally developed for commercial (not HPC!) customers. My argument was that, even if you don't want to pay cloud providers to run urgent HPC workflows as a managed service, you can use these services (and the software infrastructure on which they're built) as a blueprint for how to build these capabilities in your own HPC environments.Concluding thoughtsThe ISC'24 conference was fantastic, and I am glad it has not lost the unique elements that made me want to attend in the years prior to the pandemic. It's still that smaller, intimate, and focused HPC conference that brings the community together. Although a lot of my synopsis above may sound critical of the content presented over the four days I attended, the fact that I've had so much to write down in this blog post is a testament to the value I really get out of attending: it makes me sit down and think critically about the way the HPC community is evolving, what the leading minds in the field are thinking, and where I might be able to contribute the most in the coming year.I never much paid attention to the annual taglines of conferences like ISC, but this year's \"Reinvent HPC\" really resonated. The HPC community is at a crossroads. Exascale computing for science is now in the rear-view mirror, and large-scale AI is all the rage across the computing industry at large. But for the first time ever, this new direction in at-scale computing is happening without the inclusion of the people and organizations who've historically driven innovation in HPC. Whereas institutions like Oak Ridge, RIKEN, Cray, and Fujitsu defined the future of computing for decades, hundred-person startups like OpenAI and Anthropic are now paving the way in partnership with companies like Microsoft and Amazon.HPC needs to be reinvented, if for no other reason than to decide whether the HPC community wants to be inclusive of new frontiers in computing that they do not lead. Does the HPC community want AI to be considered a part of HPC?Judging from many speakers and panelists, the answer may be \"no.\" To many, it sounded like AI is just another industry that's sucking all the air (and GPUs) out of the room; it's a distraction that is pulling funding and public interest away from solving real problems. It's not something worth understanding, it's not something that uses the familiar tools and libraries, and it's not the product of decades of steady, government-funded improvements. AI is \"them\" and HPC is \"us.\"Personally, I'd like the answer to be \"yes\" though. Now that I'm on the other side of the table, supporting AI for a cloud provider, I can say that the technical challenges I face at Microsoft are the same technical challenges I faced in the DOE. The desire to deeply understand systems, optimize applications, and put world-class computing infrastructure in the hands of people who do amazing things is the same. And as the days go by, many of the faces I see are the same; instead of wearing DOE or Cray badges, my lifelong colleagues are now wearing NVIDIA or Microsoft badges.All this applies equally to whether cloud is HPC or not. The HPC community needs to reinvent itself to be inclusive of everyone working towards solving the same problems of computing at scale. Stop talking about people who work on commercial AI in cloud-based supercomputers as if they aren't in the room. They are in the room. Often near the front row, snapping photos, and angrily posting commentary on Twitter about how you're getting it all wrong.HPC has historically been used to solve scientific problems, whether to expand our understanding of the university, to find the next best place to drill an oil well, or to model the safety of aging nuclear weapons. The fact that HPC is now being used to solve squishier problems related to natural language or image generation does not change the essence of HPC. And whether that HPC is delivered through physical nodes and networks or virtualized nodes and networks is irrelevant, as long as those resources are still delivering high performance. AI is just as much HPC as scientific computing is, and cloud is just as much HPC as OLCF, R-CCS, or CSCS is.So perhaps HPC doesn't need to be reinvented as much as the mindset of its community does.That all said, I am genuinely impressed by how quickly ISC'24 has been reinventing itself in recent years. It wasn't too long ago that all its keynote speakers were greybeards from a predictable pool of public HPC centers all saying the same things year after year. It's wonderful to see a greater diversity of perspectives on the main stage and torches passing on to the next generation of leading figures in the field. And it was not lost on me that, for the first time in the history of this conference, Thomas Sterling did not deliver the closing keynote. As much fun as I had poking fun at his meandering and often-off-the-mark conjectures every year, it was delightful to be exposed to something new this year.I'm hopeful that ISC will continue to get better year over year, and ISC'25 will feel more inclusive of me despite the fact that I am now one of those hyperscale cloud AI people. So long as I still feel like it's my community, though, I will keep showing up in Germany every summer.",
            "content_html": "<p>I had the great pleasure of attending the ISC High Performance conference this month, marking the fifth time I've attended what has become one of my top must-attend industry conferences of the year. This year was particularly meaningful to me because it is the first time that:</p><p></p><ol style=\"text-align: left;\"><li>I attended ISC as a Microsoft employee. This is also the first time I've attended any HPC conference since I changed my focus from storage into AI infrastructure.</li><li>I attended ISC in-person since before the pandemic. It's also the first time I've visited Hamburg which turned out to be an absolute delight.</li></ol><p></p><p>Although registrations have been lower since the pandemic, this year's final registration count was over 3,400 attendees, and there was no shortage of old and new colleagues to bump into walking between the sessions at the beautiful Congress Center Hamburg.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p><br />&lt;p&gt;This year’s theme was “Reinvent HPC,” and that idea—that HPC needs to reinvent itself—was pervasive throughout the program. The whole industry had been pulling towards exascale for the better part of a decade, and now that there are two exaflop systems on Top500 and the dust is settling, it feels like everyone is struggling to figure out what’s next. Is it quantum? AI?&lt;/p&gt;</p><p>It was difficult for me to draw a line through all the topics worth reviewing at this year's ISC, as it was a very dense four days packed with a variety of topics, discussions, vendors, and events. I only experienced a fraction of everything there was to be seen since so many interesting sessions overlapped, but I thought it might be worthwhile to share my perspective of the conference and encourage others to do the same.<span></span></p><p></p><div id=\"toc\"><h2>Table of Contents</h2><ul><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section1\">Reinventing HPC (and blast those hyperscalers!)</a><ul><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section11\">Kathy Yelick's opening keynote</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section12\">Closing keynotes on the future</a></li></ul></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section2\">Top500 and Aurora</a><ul><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section21\">#1 - Frontier</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section22\">#2 - Aurora</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section23\">#3 - Eagle</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section24\">Other notable tidbits</a></li></ul></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section3\">Everyone is an AI expert!</a><ul><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section31\">The Exascale AI Synergies LLM Workflows BOF</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section32\">AI Systems for Science and Zettascale</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section33\">Real applications of generative AI for science</a></li></ul></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section4\">High Performance Software Foundation</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section5\">Quantum computing</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section6\">Reinvent HPC to include urgent computing?</a><ul><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section61\">The Urgent Computing focus session</a></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section62\">The Interactive and Urgent HPC workshop</a></li></ul></li><li><a href=\"http://blog.glennklockwood.com/feeds/posts/default/-/hpc?alt=rss#section7\">Concluding thoughts</a></li></ul></div><h2 id=\"section1\">Reinventing HPC (and blast those hyperscalers!)</h2><p>The need to reinvent HPC was the prevailing theme of the conference from the very first session; with the listing of Aurora as the second system on Top500 to break the 1 exaflops barrier, the community is in search of a new milestone to drive research (and funding!). At the same time, commercial AI has rapidly risen up largely in an independent, parallel effort with a speed and scale that begs the question: how important was the decade-long drive to break the exaflops barrier if the AI industry could catch up so quickly without the help of the institutions that have historically posted the top HPL scores? If the commercial AI industry overtakes scientific computing as the world leader in deploying at scale, how can “HPC” be reinvented so it can continue to claim leadership in another dimension?</p><h3 id=\"section11\">Kathy Yelick's opening keynote</h3><p>ISC’s opening keynote was given by Kathy Yelick, where she provided commentary on two recent government-commissioned reports on the future of HPC:</p><p></p><ol style=\"text-align: left;\"><li><a href=\"https://nap.nationalacademies.org/catalog/26916/charting-a-path-in-a-shifting-technical-and-geopolitical-landscape\">Charting a Path in a Shifting Technical and Geopolitical Landscape: Post-Exascale Computing for the National Nuclear Security Administration</a>, commissioned by the National Academies</li><li><a href=\"https://www.osti.gov/biblio/1989107\">Can the United States Maintain Its Leadership in High-Performance Computing?</a>, commissioned by the US Department of Energy’s Advanced Scientific Computing Research program</li></ol><p style=\"text-align: left;\">Living up to her reputation, Dr. Yelick’s talk was fast and insightful, describing the insatiable demand for computing driven by scientific research, the struggle to expose continuing amounts of parallelism to make use of newer processors, and some promising directions to address that disconnect. However, her talk started in a direction that I didn’t like when she went into describing the disruptors that necessitate reinventing HPC:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">The above slide implied that AI, quantum, or cloud may pose an existential threat to the HPC community gathered at ISC this year; this immediately raised my hackles, as it cast the relationship between “HPC” and “AI”/“cloud” as having some sort of adversarial tension. As the talk went on, I realized that “HPC” didn’t really mean “high-performance computing” to her. Rather, it was used to refer to something much more narrowly scoped—high-performance computing <i>to solve scientific problems</i>. Slide after slide, the presentation kept doubling down on this idea that “HPC” as the audience knows it is being threatened. For example, Yelick talked through this slide:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">The picture she painted is that “HPC” (denoted by companies with blue bars) no longer has influence over technology providers because the “hyperscalers” (green bars) have such an outsized amount of investment. She then used this to call on the audience to think about ways “we” could influence “them” to produce technologies that are useful for both scientific computing and low-precision AI workloads.</p><p style=\"text-align: left;\">Her talk culminated in this slide:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">Which was accompanied by this conclusion:</p><p style=\"text-align: left;\"></p><blockquote>\"So what’s a post-exascale strategic for the scientific community? It's the beat 'em or join 'em strategy. The beat 'em strategy says we’re going to design our own processors. [...] The join 'em strategy says let's leverage the AI hardware that's out there. [...] The sort of sneaky way of doing this is getting embedded in the AI community and trying to convince them that in order to make AI better for commercial AI applications, you really want to have certain features. Like don't throw away your 64-bit arithmetic and things like that.\"</blockquote><p></p><p>I found myself getting increasingly unsettled through the keynote, because this \"us versus them\" mentality put me, a long-standing member of this HPC community, in the camp of \"them.\" It was as if I was suddenly an outsider in a conference that I've been attending for years just because I no longer work for an organization that has been doing HPC since the early days of computing. Even though the clusters I support use the same NVIDIA and AMD GPUs, the same InfiniBand fabrics, and the same Lustre file systems that \"HPC\" uses, I am no longer in \"HPC\" because I am \"hyperscale\" or \"cloud\" or \"AI.\"</p><p>The underlying message is one I get; GPUs are trending in a direction that favors massive gains in lower-precision computation over FP64 performance. And the cost of HBM is driving the overall value (in FP64 FLOPS per dollar) of accelerators backwards for the first time in the history of scientific computing. But the thesis that the scientific computing community needs to be sneaky to influence the hyperscale or AI players seemed way off the mark to me. What seemed absent was the recognition that many of the \"hyperscalers\" are her former coworkers and remain her colleagues, and \"they\" sit in the same audiences at the same conferences and share the same stages as the \"HPC\" community. All that is true because \"HPC\" is not somehow different than \"cloud\" or \"AI\" or \"hyperscale.\" If there really is a desire to influence the hyperscale and AI industry, the first step should be to internalize that there is no \"us\" and \"them.\"</p><h3 id=\"section12\">Closing keynotes on the future</h3><p>Just as the conference was opened with a talk about this \"us versus them\" mentality, it was closed with a talk about \"us versus them\" in a keynote session titled, \"Reinventing HPC with Specialized Architectures and New Applications Workflows\" which had two speakers followed by Q&amp;A.</p><h4>Chiplets for modular HPC</h4><p>John Shalf gave one half of the closing keynote, where he gave his usual rally for investments in chiplets and specialized processors for HPC:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">He gives a variant of this talk at every ISC, but this year he lasered in on this notion that the \"HPC\" community needs to do what the \"hyperscalers\" do and use chiplets to develop custom ASICs. It was an energetic and impassioned talk, but this notion that hyperscalers are already executing on his idea for the future sounded a little funny to me seeing as how I now work for one of these hyperscalers and his message didn't resonate.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">If you really follow the money, as Shalf suggested, a huge amount of it is flowing into GPUs, not specialized processors. It wasn't clear to me what specialization he was thinking of when he referred to custom silicon being developed by the likes of Meta, Google, AWS, and Microsoft; it's true that these companies are developing their own silicon, but those efforts are largely addressing cost, risk, and supply, not improving performance beyond more general-purpose silicon like GPUs. And it turns out that a significant fraction of the (non-US) HPC community is already developing custom silicon for the same reasons as the hyperscalers; Japan, China, and Europe are all developing their own indigenous processors or accelerators for scientific computing at leadership scales. In that sense, Shalf was preaching to the choir given that, on the international stage, his government is the odd one out of the custom silicon game.</p><p style=\"text-align: left;\">He also suggested a dichotomy where the HPC community would either have to just (1) make every scientific problem an AI problem or (2) join this journey towards making domain-specific accelerators, ignoring the significant, unexplored runway offered by using mixed precision arithmetic in scientific applications. He called for partnering with hyperscalers, but his examples of implementing a RISC-V-based stencil accelerator and a SambaNova-based DFT processor didn't draw a clear line to the core missions of the large hyperscalers he extolled. He briefly said that partnering would benefit hyperscalers by addressing some capital cost challenges, but seeing as how the annual capital expenditures of the hyperscalers outstrips those of the US national HPC effort by orders of magnitude, I couldn't understand what the hyperscalers would stand to gain by partnering in this way.</p><h4 style=\"text-align: left;\">Integrating HPC, AI, and workflows</h4><p style=\"text-align: left;\">Rosa Badia gave the second half of the closing keynote where she proposed ideas around complex scientific workflows and the novel requirements to support them. This talk felt a lot more familiar, as the focus was squarely on solving scientific computing challenges by connecting traditional HPC resources together in nontraditional ways using software whose focus goes beyond cranking out floating point arithmetic.</p><p style=\"text-align: left;\">As she spoke, I couldn't help but see parallels between the challenges she presented and the sort of technologies we live and breathe every day in cloud services.  For example, she showed this slide:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">Dr. Badia obviously wanted to make a cloud-tie in by calling this \"HPC Workflows as a Service,\" but what I'm not sure she realized is that this model almost exactly describes platform-as-a-service frameworks that already exist in commercial clouds. For example,</p><p style=\"text-align: left;\"></p><ul style=\"text-align: left;\"><li>What she calls a \"Data Catalog\" is a public or private object storage account (a blob container, an S3 bucket) or a PaaS abstraction built atop them</li><li>What she calls a \"Software Catalog\" is a container registry (Azure Container Registry, Amazon Elastic Container Registry) or an abstraction built atop them</li><li>A \"Workflow Description\" is something like an <a href=\"https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-component-pipeline-python\">AzureML pipeline</a> or <a href=\"https://sagemaker.readthedocs.io/en/stable/amazon_sagemaker_model_building_pipeline.html\">SageMaker pipeline</a></li><li>A \"Workflow Registry\" is just a Github repository containing pipelines</li><li>The \"Portal\" is the web UI provided by AzureML or SageMaker</li></ul><p></p><p style=\"text-align: left;\">I don't think there's anything truly new here; the challenges she described lie in wedging these workflows into HPC infrastructure which lacks the platform features like robust identity and access management (i.e., something better than LDAP that supports more modern authentication and authorization flows and finer-grained access controls) and data management (i.e., something better than a parallel file system that depends on POSIX users, groups, and permissions and implicit trust of clients).</p><p style=\"text-align: left;\">She went on to describe a workflow data management system that reinvented a bunch of infrastructure that is already baked into commercial cloud object stores like Azure Blob and AWS S3:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">As she was describing the requirements for such a workflow data management layer, it struck me that what the scientific data community calls \"<a href=\"https://en.wikipedia.org/wiki/FAIR_data\">FAIR principles</a>\" are the same basic requirements for operating in commercial environments where data may be subject to strict privacy and compliance regulations. The notion of findable data may be aspirational for scientific datasets, but when a company is having to find datasets because it's being sued or subpoenaed, findability is a bare-minimum requirement for any data management system. Similarly, tracking the provenance of data may be a nice-to-have for scientific data, but it is a hard requirement when establishing a secure software supply chain. Cloud storage systems solved many of these challenges a long time ago, and I can't help but wonder if this idea that workflows in HPC pose a new set of challenges is another manifestation of \"us\" not realizing \"they\" might have done something useful and applicable for science.</p><p style=\"text-align: left;\">Badia's final slide had a particularly poignant statement which read, \"Systems can only be justified if we have applications that need them.\" I think she was trying to call for more investment in application development to exploit new systems, but I think the inverse is also true. If modern scientific applications truly require more complex orchestration of compute and data, maybe the scientific computing community should stop building computing platforms that make it really difficult to integrate different systems.</p><p style=\"text-align: left;\">Again, \"HPC\" is not the opposite of \"cloud;\" it's not an either/or decision. There are technologies and tools that were designed from the beginning to simplify the secure connection of services and resources; they just weren't invented by the HPC community.</p><h2 id=\"section2\">Top500 and Aurora</h2><p>One of the cornerstones of ISC is the semiannual release of the Top500 list, and unlike at SC, the Top500 announcements and awards do not overlap with any other sessions, so it tends to have a higher profile and draw all attendees. This go-around, there were no dramatic changes in the Top 10; the new Alps system at CSCS was the only new entry, and the order of the top five systems remained the same. Notably, though, Aurora posted a significantly higher score than at SC'23 and broke through the exaflops barrier using 87% of the system, cementing its place as the second exascale system listed. But let's start at the top.</p><h3 id=\"section21\">#1 - Frontier</h3><p>Frontier at Oak Ridge remained #1, but it squeezed twelve more petaflops out of the same node count and is now just over 1.2 EF. Nothing groundbreaking, but it's clear evidence that ORNL is continuing to tune the performance of Frontier at full system scale.</p><h3 id=\"section22\">#2 - Aurora</h3><p>Aurora, on the other hand, finally eked over the exaflops line with 1.012 EF using 87% of the system's total 63,744 GPUs. Rick Stevens gave a short talk about the achievement which is summed up on this slide:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>I was a little surprised by how honest Stevens was in this talk; the typical game that is played is that you stand up on stage, talk about how great of a partnership you had with your partners to realize this achievement, extol the virtues of the technologies on which your system was built, and talk about how this HPL score is just the start of a lot of great science.</p><p>Stevens didn't do that though.</p><p>He started out by telling the conference that Intel had bad product names, then explained that their low Graph500 and HPCG scores were the result of their exclusive focus on breaking the exaflops barrier with HPL, implying they didn't have time or ability to run Graph500 or HPCG at the same 87%-89% scale as their HPL and HPL-MxP runs. Based on this, it sounds like Aurora is still a ways away from being stable at scale, and we're unlikely to see any Gordon Bell-nominated papers at SC'24 this November.</p><p>After this session, folks seemed to relish in <a href=\"https://x.com/hpc_guru/status/1792018127464874176?s=61\">dunking on Aurora</a>; its <a href=\"https://x.com/hpc_guru/status/1790248333120000500?s=61\">window to be #1 is likely to have closed</a> and it has <a href=\"https://x.com/hpc_guru/status/1790273734730985865?s=61\">some power efficiency issues</a>. But I don't think anyone involved with the Aurora project needs to be told that; if what Stevens implied is true, the folks at ALCF, Intel, and HPE have been struggling for a long time now, and topping out over 10<sup>18</sup> was a hard-sought, major milestone to be celebrated. The Aurora project has been thrown more curveballs than I would have ever guessed a single HPC project could have, so all parties deserve credit for sticking it through all this way rather than just walking away. With any luck, Aurora will stabilize in the next six months, and we'll see full-scale runs of Top500, Graph500, HPCG, and science apps by November.</p><h3 id=\"section23\">#3 - Eagle</h3><p style=\"text-align: left;\">The third highest system on the list was Eagle, whose HPL score was not updated since the system was first listed at SC'23 last year. Through a few twists of fate, I wound up being the person who accepted the award on-stage, and I now have a Top500 award for the #3 system sitting in my home office. Here's a photo of me goofing around with it:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">It's not entirely inappropriate that I was the one to accept it since my teammates are the ones carrying pagers for the on-call rotation of that system, and we were also the hands-on-keyboard when that HPL run was conducted. Still, it was a bit surreal to walk on-stage to pick up such a noteworthy award immediately following two actually important people (both of whom have \"director\" in their titles) accepting the same award. By comparison, most of my career highlights to date have been just trolling HPC people on Twitter (as the esteemed Horst Simon actually said out loud as I was leaving the stage!)</p><p style=\"text-align: left;\">It was weird.</p><p style=\"text-align: left;\">That said, I take this to mean that it is now my duty to be the friendly face from Microsoft who can speak intelligently about the #3 system on Top500. To that end, I'll answer some questions that I was asked at ISC about the system and Azure HPC clusters in general below. <i>None of this is new or secret information!</i></p><p style=\"text-align: left;\"></p><ul style=\"text-align: left;\"><li><b>Why didn't you run HPL again and post a higher score to beat Aurora?</b> Because the day after that HPL run completed, that system was put into production. Once systems are in production, people are paying to use them, and taking a time-out to re-run HPL costs a ton of money in either real dollars (if a customer runs it) or lost revenue (if the HPL run is blocking customer workloads). This is quite different from public-sector HPC systems which never have to pay for themselves.</li><li><b>Can I get access to Eagle for a Gordon Bell run or to test software?</b> That's not really how it works. Whereas a traditional supercomputer might allow users to ssh in and submit jobs to a Slurm queue, cloud-based supercomputers allow users to deploy virtual machines through a REST API. Those virtual machines can allow ssh, run Slurm, and support MPI jobs like HPL, but that OS environment is managed by Azure users, not Azure itself. You can get a taste for what's required to run a basic MPI job by reading some instructions I wrote on <a href=\"https://www.glennklockwood.com/cloud/mpi-cluster.html\">provisioning an MPI cluster on Azure</a>.</li><li><b>Is it just a bunch of GPU nodes scattered around a bunch of data centers?</b> No, all the nodes on any given Azure HPC cluster (like Eagle) share an InfiniBand fabric. There are countless InfiniBand clusters in Azure, but each one is a real supercomputer by any definition of a supercomputer, and they are designed to run tightly coupled job across all their GPUs.</li><li><b>What parallel file system does it use?</b> Don't think about it that way. You can provision a Lustre file system and mount that to any or all cluster nodes if you want to, or you can access data directly from object storage.</li><li><b>Are there any photos of it?</b> You can see a photo of one of the Microsoft-designed nodes that comprise the system on my <a href=\"https://blog.glennklockwood.com/2023/11/sc23-recap.html\">SC'23 recap blog post</a>. Beyond that, there's not much to look at because Azure HPC clusters are not meant to be photogenic like, say, Cray supercomputers. There's no rack graphics (or even rack doors!). It's just tons and tons of air-cooled racks with InfiniBand optics coming out of each one. Maybe the only unique thing is that the racks are painted white instead of the typical black. Not sure why.</li></ul><div>Getting back to that false separation between \"HPC\" and \"cloud,\" Eagle is strong evidence that they aren't different. What the \"hyperscalers\" do is not that different from what traditional HPC centers do. Perhaps the biggest difference is that cloud supercomputers get all the benefits of cloud infrastructure like software-defined infrastructure like virtual machines and virtual networking, integration with identity and access management that transcends simple Linux UIDs/GIDs, and the flexibility to integrate with whatever storage systems or ancillary services you want from any compute node.</div><p></p><h3 id=\"section24\">Other notable tidbits</h3><p>It is tradition for Erich Strohmaier to talk through some highlights and trends of the latest Top500 list every time a new one is announced, and in the past, <a href=\"https://x.com/glennklockwood/status/1140637729182683136?s=61\">I've been critical</a> of how he's presented conclusions from the list with this implicit assumption that computers that never post to Top500 simply don't exist. This year felt different, because Dr. Strohmaier made the explicit statement that China has completely stopped submitting to Top500. Their exascale systems aren't listed, but neither are any new systems in the past three years at the bottom. They simply don't play the game anymore, making it undeniable that Top500 is no longer an authoritative list.</p><p>Just as the whole conference's theme was reinventing HPC, I felt a sense that even the most stalwart proponents of Top500 are now recognizing the need to reinvent the Top500 list. Kathy Yelick said as much during her keynote (\"Shall we replace Top500? What are the metrics in post-exascale computing that are important?\"), and Erich implored the audience to help expand the <a href=\"https://hpl-mxp.org/\">HPL-MxP</a> (formerly HPL-AI; an HPL-like benchmark that can use the mixed-precision capabilities of tensor cores) list. Nobody seems to know how to quantify what makes a leadership supercomputer nowadays, but accepting that HPL scores (or appearing on the Top500 list!) won't cut it is a good first step.</p><p>That all said, Top500 is still a valuable way to track technology trends in the industry. For example, this edition of the list where NVIDIA's new Grace-Hopper node started appearing in force. The only new entrant in the Top 10 was the <a href=\"https://www.top500.org/system/180259/\">270 PF GH200</a> component of <a href=\"https://www.cscs.ch/computers/alps\">CSCS's Alps system</a>, and HPEhad these EX254n GH200 blades on display on the show floor.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>To HPE/Cray's credit, they seem to have gotten the system up and running with Slingshot without the delays that plagued early Cray EX systems like Frontier and Aurora. Hopefully this is a sign that the Cray EX platform and Slingshot-11 have graduated from being risky and not-quite-production-ready.</p><p>The other notable entrants on this year's Top500 are a trio of <a href=\"https://www.top500.org/system/180283/\">early MI300A APU-based Cray systems</a> being built around the El Capitan program at Lawrence Livermore National Laboratory. This is a positive sign that MI300A is up and running at modest scale, and HPE also had one of these EX255a blades on display at their booth:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>The strong showing of MI300A suggests that we may see El Capitan take the top spot in the next edition of the Top500 list coming in November.</p><h2 id=\"section3\">Everyone is an AI expert!</h2><p>Since I now work on a team responsible for AI infrastructure, I tried attending as many of the AI-focused talks and panels as I could this year. Unsurprisingly, these sessions largely carried the same undertones of \"reinventing HPC,\" and speakers opined on how AI would affect scientific computing and offered examples of what their institutions were doing to extend their leadership in the HPC space into the AI space. There was a fair amount of grasping going on (as there always is when AI is discussed at non-AI conferences), but this year I was struck by how confused so many speakers and attendees were about concepts related to applying AI.</p><p>To be clear: I am no expert in AI. However, my day job requires that I be steeped in some of the largest AI training workloads on the largest AI supercomputers on the planet, and I have to have a cursory understanding of the latest model architectures and techniques to anticipate how future system designs will have to evolve. It's from this perspective that I made the following observation: there are a lot of HPC people speaking very confidently about AI based on an outdated understanding of the state of the art. The AI industry generally moves much faster than the government-funded research community, and I couldn't help but wonder if some community leaders assumed that the AI industry today is the same as it was the last time they wrote their AI grant proposal.</p><p>Of course, there were also some really insightful perspectives on AI for science shared as well. Let's talk through some examples of both.</p><h3 id=\"section31\">The Exascale AI Synergies LLM Workflows BOF</h3><p>This realization that the ISC community is not keeping up with the AI community first slapped me in the face when I ducked into a BOF session titled, \"<a href=\"https://isc.app.swapcard.com/event/isc-high-performance-2024/planning/UGxhbm5pbmdfMTgyNjgxMQ==\">Tales of Exascales – AI and HPC Supercomputing Platforms Synergies for Large Language Models (LLMs) and Scientific Workflows</a>.\" I sometimes wonder if the organizers who propose titles like that are intentionally creating word salad, but in this case, it was apt session name; the discourse around HPC and AI was all over the board throughout the hour.</p><p>The session started on a strong, positive note by Simon McIntosh-Smith describing Bristol's new <a href=\"https://www.bristol.ac.uk/news/2023/september/isambard-ai.html\">Isambard-AI system</a>, a GH200-based Cray supercomputer funded under the broad charge of \"AI research.\" While I'm usually skeptical of such nebulously defined \"AI research\" machines, Dr. McIntosh-Smith's description of the project quickly checked a bunch of boxes on how a real AI research platform should be developed. In particular,</p><p><b>Isambard-AI was developed and deployed at the pace of AI rather than HPC for scientific computing</b>. Whereas government-funded, large-scale HPC systems typically take years to procure, Simon said that the first discussions started in August 2023, and in the nine months that followed, they had built the site, the team, and the system itself to the degree that <a href=\"https://www.top500.org/system/180257/\">a piece of the final system is already on Top500</a>. By comparison, LLNL's El Capitan supercomputer also debuted on Top500 this month, but <a href=\"https://www.energy.gov/articles/does-nnsa-signs-600-million-contract-build-its-first-exascale-supercomputer\">its contract was signed five years ago</a>, and its procurement began <a href=\"https://web.archive.org/web/20200605114639/https://asc.llnl.gov/coral-2-benchmarks/\">at least two years before that</a>. The AI industry would not exist if the systems it trains on took seven years to procure.</p><p><b>Isambard-AI deliberately avoided exotic AI accelerators to remain future-proof</b>. Simon rightly pointed out that the AI industry moves too quickly to anticipate whether a bespoke AI accelerator would even be relevant to whatever the hottest model architecture will be in a year. GPUs were chosen because they are the most flexible way to accelerate the widest range of AI workloads, regardless of if they are dense models, sparse models, inferencing, training, and whatever level of quantization makes sense. The reality is that cutting-edge research is done on GPUs, so aligning an AI supercomputer on the same technology will ensure that the algorithms developed by industry are immediately usable for scientific research.</p><p><b>A reasonable definition of \"AI for science\" was defined from the outset</b>. Rather than blurting out \"we need to research AI!\" and asking for a sack of money to buy GPUs, Simon outlined a vision of training AI models using data generated by physical simulation on a more conventional HPC system. Training models on models to create surrogate models is not particularly new, but it does establish a few reasonable architectural decisions such as having a robust data management and sharing platform, close coupling to the HPC system performing simulation, and aligning software stacks and programming environments as closely as possible.</p><p>Simon's contribution to the discussion stood out to me as the most impressive, and the discourse seemed to fall into a trap of familiarity following. Rather than focusing on the new and exciting prospects of AI, some panelists and audience members wanted to focus on the aspects of AI they understood. For example, an uncomfortable time was spent on a back-and-forth on how HPC centers can support Kubernetes and random I/O (which is what defines AI vs. HPC?) instead of Slurm and Lustre. If your biggest challenge in delivering infrastructure to support AI workloads is figuring out how to deploy both Kubernetes and Slurm, you haven’t even reached the starting line. This is a trivial issue in cloud environments, where entire AI clusters can be built up and torn down in minutes. Again, this is evidence that the scientific computing community isn’t ready to keep pace with the AI industry.</p><p>I jotted down a few of the questions and comments that I heard during this BOF that seem to reflect the level of familiarity the average ISC attendee has with AI:</p><p></p><ul style=\"text-align: left;\"><li><b>\"Would be nice if there were more models for science.\"</b> I wasn't sure sure what this means. All the leading LLMs are pretty good at \"science,\" and domain-specific models aren't readily transferable between different science domains or problems.</li><li>Scientific problems <b>\"have to validate outputs for correctness, unlike LLMs.\"</b> I think the speaker was making a sidelong reference to hallucinations, but like with any model (large language or physics-based), validating outputs for correctness is certainly necessary and readily possible.</li><li><b>\"The demands of inference of LLMs are completely different from those for training. How do you buy inference infrastructure?\"</b> I wonder where this notion came from. If your infrastructure can train a model, it can definitely inference that model. Cost-optimizing infrastructure for inferencing is a separate matter (you can cut corners for inferencing that you wouldn't want to cut for training), as is building the service infrastructure around inferencing to deliver inferencing as a service. But I don't think that's what this question was about.</li><li><b>\"Working safely with sensitive data / isolating workloads on big shared clusters.\"</b> This is a problem that arises only when you try to wedge AI workloads into infrastructure designed for traditional physics-based simulation. If you have sensitive data, don't use big shared clusters. Provision separate clusters for each security domain on a shared, zero-trust infrastructure.</li><li><b>\"How different are the files and filesystem access while training for LLMs, image generation models, reinforcement learning?\"</b> This question reflects a general misunderstanding of data and storage in HPC overall; how data is organized into files and how that data is accessed by a workload is an arbitrary decision made by the application developer. You can organize piles of text into one giant file or a million little files.</li></ul><p></p><p>There were a few questions that came up that touched on deeper issues on which the HPC community should reflect:</p><ul><li><b>\"What are the first steps for scientific groups wanting to get ready for using AI in the future?\"</b> This is probably the purest question raised in the entire session, and I think this is something the scientific computing community as a whole needs to figure out. What does \"using AI\" really mean for scientific groups? Is it training models? Fine-tuning models? Inferencing using pre-trained models on HPC infrastructure? Is it integrating simulation applications with separately managed inferencing services? Who manages those inferencing services? Does inferencing even require HPC resources, or can suitable models run on a few CPU cores? I think the first step to answering this question is ensuring that the scientific computing community reaches a common baseline level of understanding of \"using AI\" means. And a lot of that probably means ignoring what some self-professed AI experts in the HPC community claim is the future.</li><li><b>\"Care to predict what that ChatGPT moment will be for AI for Science? Had it already happened?\"</b> This question was addressed directly by panelist Séverine Habert who rightly pointed out that the ChatGPT moment occurred when a complex and esoteric topic was suddenly put in the hands of hundreds of millions of laypeople across the world. It was the moment that the common person walking on the street could suddenly interact with the most cutting-edge technology that had been previously understandable only to the headiest of researchers in industry and academia. That will likely never happen in AI for science because science, by definition, requires a higher baseline of education and understanding than the average layperson has.</li><li><b>\"How to effectively train the existing workforce when we are already struggling to retain talent in research/academia?\"</b> This question strikes at the same theme that Kathy Yelick's opening keynote confronted: what is the role of the scientific computing community now that it turns out that you don't need decades of institutional experience to deploy and use HPC resources at leadership scale? As offensive as it may sound, perhaps the public-sector HPC community should accept that their role is not training future researchers and academics, but training future practitioners of AI in industry. This is how the wider tech industry generally works; neither startups nor tech giants make hires assuming those people will still be around in ten years. Why does the public-sector HPC industry think otherwise?</li></ul><p>Finally, I was also struck but how fiercely the discourse clung to the idea that large language models are the answer to all AI problems in science. I get that this panel was focused on exascale, and LLM training is one of the rare cases where AI requires exascale computing capabilities. But there was no acknowledgment that trillion-parameter models are not actually a good idea for most scientific applications.</p><h3 id=\"section32\">AI Systems for Science and Zettascale</h3><p style=\"text-align: left;\">This singular focus on creating massive LLMs for science was front-and-center in a talk given by Rick Stevens titled \"<a href=\"https://isc.app.swapcard.com/event/isc-high-performance-2024/planning/UGxhbm5pbmdfMTg4MTE0Mg==\">The Decade Ahead: Building Frontier AI Systems for Science and the Path to Zettascale</a>.\" The overall thesis that I heard was something like...</p><div><ol style=\"text-align: left;\"><li>Science needs its own trillion-parameter foundation models</li><li>Training trillion-parameter foundation models requires a lot of GPUs</li><li>We need $25 billion from the U.S. government</li></ol><p style=\"text-align: left;\">However, Stevens never answered a very basic question: what does a foundation model for science do that any other foundation model cannot do?</p><p style=\"text-align: left;\">He showed slides like this which really don't sound like foundation models for science as much as a generic AI assistants:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">Is the scientific computing HPC community really the most qualified bunch to reinvent what existing foundation models like GPT-4 or Claude 3 have already done? Even if you argue that these proprietary models aren't as good at \"science\" as they could be, who would have a better chance of addressing this with a billion dollars of federal funding: the companies who developed GPT or Claude, or a collection of government scientists starting from scratch?</p><p style=\"text-align: left;\">I think the answer to this question was in other parts of Stevens' talk. For example, he started with this slide:</p></div><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>While robust requirements are good when there's no urgency, this slide is also a tacit admission that the government takes years to general a perspective on AI. Do you think the creators of Llama-3 or Mistral Large gathered wide community input from over 1,300 researchers before deciding to build a supercomputer and train a model? Even if science needs its own foundation models, this slide is strong evidence that, by the time the scientific HPC community agrees on a path forward, that path will be years out of date relative to what the commercial AI industry is doing.</p><p>A great example of this already happening is the basic premise that creating a foundation model with a trillion parameters is the best way to apply AI to solve science problems. This certainly was the leading thought two years ago, when transformer scaling laws were published that suggested that the best way to get better-performing LLMs was to simply add more parameters to your transformer and train on more data. But there's a reason all the leading models have stopped advertising how many parameters they use.</p><p>Dealing with massive transformers is really expensive. They're not only really expensive to train, but they're really expensive to use for inferencing too. This has led to a bunch of innovation to develop model architectures and approaches to training that result in dramatically higher quality outputs from a fixed parameter count. Dense transformer architectures with a trillion parameters have become the blunt instrument in developing foundation models since 2022, so it took me by surprise to hear Stevens put so much stock into this notion that the need for a trillion-parameter model is essential for science.</p><p>To repeat myself, I am no expert in AI. I've never been <a href=\"https://www.energy.senate.gov/services/files/CF8309D8-C0A1-40C7-944F-CF71EF523FF8\">called in front of Congress to talk about AI</a> or been <a href=\"https://isc.app.swapcard.com/event/isc-high-performance-2024/planning/UGxhbm5pbmdfMTg4MTE0Mg==\">invited to give talks on the topic at ISC</a>. There might be something basic that I am missing here. But when I look at the science drivers for AI:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>I <i>know</i> that you do not need to train your own trillion-parameter model to do most of this stuff. Even the use cases that do require generative AI, like code generation and math theory, don't actually require trillions of parameters. Small language models, such as that described in <a href=\"https://arxiv.org/abs/2306.11644\">Textbooks Are All You Need</a> (published in 2023, after the reports Stevens cited in his talk), can produce amazing results with very small models when you train them using high-quality data instead of garbage from Reddit. And when you create or fine-tune a small language model for a specific science domain, not only do you save yourself from having to buy a billion-dollar supercomputer for training, but you get a model that is much more accessible to scientists around the world because they won't need a million dollars' worth of GPUs to inference with it.</p><p>So, if there's one question that was never answered across any of the AI-themed sessions at ISC this year, it is this: Why does science need to train its own large language models? My intuition is that either fine-tuning existing large language models or training small language models for domain-specific applications, would be a better investment in actually advancing science. However, if we cynically assume the real goal of LLMs-for-science is to justify buying massive GPU systems, suddenly a lot of the talks given at ISC on this topic make a lot more sense.</p><h3 id=\"section33\">Real applications of generative AI for science</h3><p>As frustrated as I got sitting through sessions on AI where it sometimes felt like the blind leading the blind, there was one really good session on actual applications of generative AI for science.</p><p><b>Mohamed Wahib </b>of RIKEN gave an insightful presentation on the unique challenges of using generative AI in science. His summary slide touched on a lot of the key challenges:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>And his actual talk focused largely on the model and data aspects of generative AI. What struck me is that the challenges he described reflected the experience of someone who has actually tried to do what many other AI experts at the conference were claiming would be the future. For example,</p><p></p><ul style=\"text-align: left;\"><li>He recognized the importance of <b>training scientific models with high-quality datasets</b>, not just garbage scraped off of social media. This means not only scraping or generating high quality data for training, but curating and attributing that data and applying reinforcement learning with human feedback as the model is being trained. This is uniquely challenging when creating models for scientific applications, as managing the quality of scientific data requires deep domain expertise. This contrasts with a generic chat bot whose inputs and outputs can often be assessed by anyone with a basic education.</li><li>He also talked about the tendency of <b>scientific data to be highly multimodal and multidimensional</b>. Whereas multimodal chatbots may combine text and vision, scientific data often contains observations of the same phenomenon from many different sensors (for example, pressure, temperature, density, strain fields, ...), and the output of a generative model for science may require multiple modalities as well.  These capabilities are not well developed in LLMs designed for human language.</li><li>Dr. Wahib also pointed out that scientific datasets tend to be huge compared to text and images, and this may require developing ways for models to have <b>context windows can fit multi-petabyte datasets' tokens</b> to identify long-range correlations. Relatedly, he also pointed out that <b>tokenization of scientific data</b> is a new set of challenges unique to this community, since industry has been focused on tokenizing low-dimensional data such as text, audio, and images.</li></ul><p></p><p>The good news is that industry's quest towards both commercializing generative AI and achieving AGI will touch on some of these challenges soon. For example, training domain-specific models using high-quality datasets is an essential component of the small language models I described in the previous section, and these small language models are what will enable privacy-preserving and cost-effective generative AI on laptops and phones. Effectively infinite context windows are also a major hurdle on the path to AGI, as industry is hard at work developing AI agents that can remember every conversation you've ever had with them. Finding more scalable approaches to attention that do not sacrifice accuracy are a part of this.</p><p><b>François Lanusse</b>, currently at the Flatiron Institute, also gave a nice presentation that clearly explained how generative AI can be used to solve inverse problems—that is, figuring out the causes or conditions that resulted in a collection of measurements. A precise example he used applied generative AI to figure out what an image distorted by gravitational lensing might look like in the absence of those distortions. As I understood it, he trained a diffusion model to understand the relationship between images that are affected by gravitational lensing and the masses that cause lensing through simulation. He then used that model instead of an oversimplified Gaussian model as part of a larger method to solve the inverse problem of un-distorting the image.</p><p>The details of exactly what he did were a little over my head, but the insight piece for me is that combining generative AI and science in practice is not as straightforward as asking ChatGPT what the undistorted version of a telescope image is. Rather, almost all of the standard, science-informed approach to solving the inverse problem remained the same; the role of generative AI was simply to replace an oversimplified part of the iterative process (the Annealed Hamiltonian Monte Carlo method) to help it converge on better answers. It really is a combination of simulation and AI, rather than an outright substitution or surrogate model.</p><p>Dr. Lanusse also showed this slide which demonstrated how this approach can be generalized to other scientific domains:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>The general approach of pretraining, fine-tuning (\"adapt\"), and combining foundation models with other physics-based models seems reasonable, although I admit I have a difficult time wrapping my head around exactly how broadly scoped he envisions any given pretrained foundation model to be. I can see such a model trained on extensive sky survey data being useful for a number of astrophysical and cosmological tasks, but it's less clear to me how such a model might be useful in unrelated domains like, say, genomics.</p><p>You might also ask why I think this vision of foundation models for science is reasonable while Rick Stevens' vision didn't ring true; the difference is in scale! The foundation models cited on Lanusse's slide are vision transformers which have many orders of magnitude fewer parameters than the trillion-parameter models that others talk about. Whereas a trillion-parameter model might need to be distributed over dozens of H100 GPUs just to produce one inference result, the largest of the vision transformers can probably be squeezed on to a single high-end desktop GPU. Again, <i>you don't need billion-dollar supercomputers to train these models for science</i>.</p><p><b>Frank Noé</b> from Microsoft Research then talked about how generative AI can be applied to solve problems in simulating biological systems. Like the talk before his, Dr. Noé followed this pattern where a larger, physics-based framework had one statistical technique replaced by a method based on generative AI, and then a physics-based model is used to quantify the likelihood that the result is reasonable. He contrasted this with convention approaches (to, say, protein folding) where you just simulate for really long times in the hopes that your simulation randomly wanders into a situation where you capture a rare event.</p><p>His talk wasn't about generative AI as much as the previous speakers, but he offered a litany of ways in which AI models can be useful to molecular modeling:</p><p></p><ul style=\"text-align: left;\"><li><b>Markov state models</b> provide a statistical framework that lets you replace one long simulation (that hopefully captures every possible scenario) with a bunch of short, chopped-up simulations that hopefully capture every possible in parallel. He cited an example that took 20,000 GPU-days on V100 GPUs that would've otherwise taken a million GPU-years if done in one long simulation.</li><li><b>Coarse-grained models</b> use machine learning to develop surrogate models to simulate the physics of relatively uninteresting parts of molecular systems. The example he used was simulating the water molecules surrounding a biomolecule; water can be very difficult to accurately model, and the example he cited led to a surrogate model that was 100x faster than directly simulating water molecules.</li><li><b>Boltzmann generators</b> can generate 3D molecular structures based on a known probability distribution defined by the energy states of the system. This is another fast way to find rare but stable molecular configurations without having to throw darts at a dartboard.</li></ul><p></p><p>What struck me is that, in all these cases, the AI model is never generating results that are blindly trusted. Instead, they generate molecular configurations which are then fed into physics-based models which can quantify how likely they are to be valid.</p><p>Both Lanusse's and Noé's examples of combining AI and simulation painted a picture to me where generative AI can be really useful in solving problems where a researcher would otherwise have to make educated guesses about what physical phenomenon is really happening based on incomplete information. So long as there is a way to apply a physics-based model to check the accuracy of each guess, generative AI can be trained to predict the relationships between incomplete information and what's really going on and get to probable answers much faster than relying on physics alone.</p><p>More broadly, I couldn't help but think about the <a href=\"https://www.youtube.com/watch?v=Jfv5XCMj2c0\">Sora video showing pirate ships battling in a cup of coffee</a> as I left this session. Like that video, these talks demonstrated that it's possible to train generative AI models to reproduce physical phenomena (like the fluid dynamics of coffee) without explicitly embedding any laws of physics (like the Navier-Stokes equations) into the model itself and still get really compelling results. The part of this that was lacking from the Sora video—but was present in these talks—was closing the loop between generated results and the laws of physics by feeding those generated results back into the laws of physics to figure out if they are probable.</p><h2 id=\"section4\">High Performance Software Foundation</h2><p style=\"text-align: left;\">ISC'24 wasn't all about AI though! I wound up attending the launch of the <a href=\"https://www.hpsf.io/\">High Performance Software Foundation</a> (HPSF), a new Linux Foundation effort spearheaded by Todd Gamblin and Christian Trott (from Livermore and Sandia, respectively) aimed to promote the sustainability of the software packages relied upon within the high-performance computing community.</p><p style=\"text-align: left;\">I haven't paid close attention to HPC software in a long time since most of my work was in platform architecture and storage systems, so a lot of the background context remains a little murky to me. That said, it seems like HPSF was formed to be like the Cloud Native Computing Foundation for the HPC community in that:</p><p></p><ul style=\"text-align: left;\"><li>it will serve as a neutral home for software projects that aren't tied to any single university or government institution</li><li>it provides mechanisms to ensure that critical HPC software can continue to be maintained if its original author gets hit by a bus</li><li>it will help with the marketing, promotion, and marketing of HPC software</li></ul><p></p><p>Its governance seems pretty reasonable, with different levels of membership being accompanied by different levels of rights and obligations:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"><span style=\"text-align: left;\"> </span></div><p>There is a Governing Board is comprised of paying members (and predominantly those who pay the most), while the Technical Advisory Council carries out the more technical tasks of forming working groups and onboarding projects.</p><p>There are three levels of membership, and the highest (premier) has a $175,000 per year buy-in and comes with a seat on the Governing Board. Right now, the founding seats are held by AWS, HPE, LLNL, and Sandia.</p><p>Below that is a general membership tier whose cost is on a sliding scale based on the organization size, and AMD, Intel, NVIDIA, Kitware, ORNL, LANL, and Argonne have all committed at this level.  The associate tier is below that, and it is free to nonprofits but comes with no voting rights.</p><p>It seemed like the exact functions that HPSF will have beyond this governing structure are not fully baked yet, though there were six \"prospective\" working groups that provide a general scope of what the HPSF will be doing:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>My read of the description of these working groups is that</p><p></p><ul style=\"text-align: left;\"><li><b>CI/testing</b> will supply resources (GPUs) on which HPSF projects' code can be automatically tested.</li><li><b>Software stacks</b> will maintain E4S.</li><li><b>User engagement</b> sounds like it will figure out what users of HPSF projects' software are looking for. It sounds like this will provide some product management-like support for projects.</li><li><b>Facility engagement</b> is probably like user engagement, but for the sites deploying code on behalf of their users. Again, this sounds like product management functions.</li><li><b>Security</b> sounded like stewarding SBOM-like stuff for member projects' software.</li><li><b>Benchmarking</b> would make a framework for benchmarking HPC applications.</li></ul><p></p><p>That all said, it still wasn't clear what exactly HPSF would do; what would all those membership dues go towards supporting? Based on some Q&amp;A during this BOF and follow-up afterwards, I pieced together the following:</p><p></p><ul style=\"text-align: left;\"><li>HPSF will <i>not</i> be funding developers, much in the same way that OpenSFS doesn't fund Lustre development. That said, <a href=\"https://x.com/tgamblin/status/1790018859816153327\">Todd Gamblin later said</a> that not funding software development was a financial constraint more than a policy one, with the implication that if more members join, there may be opportunity for HPSF to fund projects.</li><li>HPSF likely will be hosting events and conferences (perhaps like the CNCF hosts KubeCon), providing scholarships, developing and providing training related to member projects, and \"increasing collaboration\" (whatever that may mean!).</li></ul><div>HPSF also has some influence and ownership over its member projects:</div><p></p><ul style=\"text-align: left;\"><li>HPSF will co-own its projects' GitHub repos to ensure continuity in case the other repo owner abandons it.</li><li>HPSF will own the domain for the project for the same reasons as above.</li><li>Member projects still manage their own software development, roadmaps, releases, and the like. The HPSF won't dictate the technical direction of projects.</li><li>HPSF will own the trademark and logos of its member projects so it can prevent corporations from profiting off of repackaging products without respecting trademark.</li></ul><p style=\"text-align: left;\">This establishes an interesting new direction for the sorts of software projects that are likely to become member projects. Historically, such projects developed by the member organizations (i.e., DOE labs) have been wholly controlled by the labs that funded the work, and those software projects lived and died at the whims of the government funding. The HPSF offers a new vehicle for software projects to live on beyond the end of the grants that created them, but at the same time, it requires that the DOE surrender control of the work that it sponsored.</p><p style=\"text-align: left;\">I left the session still wondering a few pretty major things, likely borne out of my own ignorance of how similar organizations (like CNCF or the Apache Foundation) work:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>How does a software project actually become a member project? The HPSF folks said that the Technical Advisory Committee onboards new projects, but what is the bar if I have an open-source project used by the community that I no longer want to maintain myself? I assume it's not a pay-to-play arrangement since that defeats the purpose of sustaining software after its seed funding runs out.</li><li>What do stakeholders actually get out of joining HPSF? I see obvious value for organizations (like the DOE labs) who develop open-source software but may not want to be exclusively responsible for sustaining it forever. But would an HPC facility get any obvious benefit from joining and paying dues if it is simply a consumer of member projects' software? What does a cloud vendor like AWS get by being a premiere member? Is HPSF just a way to get someone else to cover the overheads of maintaining <a href=\"https://github.com/awslabs\">open-source software that comes out of, say, R&amp;D organizations</a> rather than product organizations?</li></ol><p></p><p></p><p>Hopefully the answers to these questions become clearer as the foundation gets off the ground and we get to see what member organizations contribute under the HPSF banner.</p><p>Ultimately though, I see this as a really positive direction for the HPC software community that might help resolve some uncertainty around key pieces of HPC software that have uncertain ownership. For example, I wound up as a maintainer of the IOR and mdtest benchmark because I was the last one to touch it when its previous maintainer lost interest/funding. I don't even work in I/O performance anymore, but the community still uses this benchmark in virtually every procurement of parallel file systems either directly or through IO500. It would be wonderful if such an important tool didn't rest on my shoulders and had a more concrete governance structure given how important it is.</p><h2 id=\"section5\">Quantum computing</h2><p style=\"text-align: left;\">Besides AI and cloud, quantum computing was cited in Kathy Yelick's opening keynote as the third disruptor to HPC for scientific computing. At the time, I thought citing quantum was just an obligation of any opening keynote speaker, but quantum computing was particularly high-profile at ISC this year. I was surprised to see over a dozen quantum computing companies on the vendor exhibition floor, many of whom were Europe-based startups.</p><p style=\"text-align: left;\">In addition, this year's Hans Meuer award (for best research paper) was given to a paper on quantum computing by Camps et al. This is particularly notable since this is the first time that the Meuer award has ever been given to a paper on a topic that isn't some hardcore traditional HPC like MPI or OpenMP advancements; by comparison, this award has never been given to any papers on AI topics. Granted, the winning paper was specifically about how to use conventional HPC to solve quantum problems, but this recognition of research in quantum computing makes a powerful statement: quantum computing research is high-performance computing research.</p><h2 id=\"section6\">Reinvent HPC to include urgent computing?</h2><p style=\"text-align: left;\">I was invited to give a lightning talk at the <a href=\"https://www.interactivehpc.com\">Workshop on Interactive and Urgent High-Performance Computing</a> on Thursday, and urgent/interactive HPC is not something I'd really paid attention to in the past. So as not to sound like an ignorant fool going into that workshop, I opted to sit in on a focus session titled \"Urgent Computing\" on Tuesday. I had two goals:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>Make sure I understood the HPC problems that fall under urgent and interactive computing so I could hold an intelligent conversation on this topic at the Thursday workshop, and</li><li>See if there are any opportunities for cloud HPC to provide unique value to the challenges faced by folks working in urgent HPC</li></ol><div>I'll describe what I came away with through these lenses.</div><p></p><h3 id=\"section61\">The Urgent Computing focus session</h3><p style=\"text-align: left;\">What I learned from the focus session is that urgent computing is not a very well-defined set of application areas and challenges. Rather, it's another manifestation of reinventing HPC to include any kind of computation for scientific purposes.</p><p style=\"text-align: left;\">Much to my surprise, this \"Urgent Computing\" focus session was actually a session on IoT and edge computing for science. Several speakers spoke about getting data from edge sensors on drones or telephone poles into some centralized location for lightweight data analysis, and the \"urgent\" part of the problem came from the hypothetical use cases of analyzing this sensor data to respond to natural disasters. There wasn't much mention of anything requiring HPC-like computing resources; at best, a few talks made unclear references to using AI models for data analysis, but it felt like grasping:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">The above conclusion slide was presented by one of the speakers, and to be honest, I don't understand what any of it means. Granted, I know very little about urgent computing, IoT, or edge computing so there may be some domain jargon here that's throwing me off. But based on this, as someone working in the area of HPC and AI in the cloud, I don't think I have a role to play here. I'm sure <i>cloud computing</i> can help, but the challenges would be in general-purpose cloud rather than HPC.</p><h3 id=\"section62\">The Interactive and Urgent HPC workshop</h3><p style=\"text-align: left;\">Fortunately for me, the Thursday workshop on Interactive and Urgent HPC was much less about edge/IoT and more about developing software infrastructure and workflows that allow scientific data analysis of large datasets to happen before the results become obsolete. It was a fascinating workshop for learning about specific science drivers that require fast access to HPC resources, and how different HPC providers are enabling that through non-traditional services and policies. Below are a few highlights.</p><p style=\"text-align: left;\"><b>Sam Welborn (NERSC)</b> presented his team's efforts to convert a streaming data workflow from its current file-based approach into one that streamed directly into compute node memory. The specific use case was the initial data processing for image information coming off of a scanning transmission electron microscope at 480 Gbps, totaling 750 GB per shot. As he described it, the current technique involves streaming those data to files at the microscope, then copying those files to the parallel file system of a remote supercomputer, then reading, processing, and writing that data within the HPC environment to prepare it for downstream analysis tasks. And for what it's worth, this is how I've always seen \"streaming\" HPC workflows actually work; they're actually using file transfers, and the performance of both the file system at the source and destination are in the critical path.</p><p style=\"text-align: left;\">The problem with this approach is that parallel file systems on HPC systems tend to be super flaky, and there's no real reason to bounce data through a storage system if you're just going to pick it up and process it. So, Dr. Welborn showed a true streaming workflow that skipped this file step and used ZeroMQ push sockets at the microscope and pull sockets on the HPC compute nodes to do a direct memory-to-memory transfer:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">Seeing software like ZeroMQ used to enable communication in an HPC environment instead of forcing this workflow to fit into the MPI paradigm is an encouraging sign in my eyes. ZeroMQ, despite not using purpose-built HPC technology like RDMA, is the right tool for this sort of job since it supports much better resilience characteristics than messaging libraries designed for tightly coupled HPC jobs. Workflows like this that combine beefy GPU nodes with software developed in the commercial tech space suggest that the world of HPC is willing to abandon not-invented-here ideology.</p><p style=\"text-align: left;\">It wasn't clear to me that there's a great opportunity for cloud HPC to be uniquely useful in use cases like this; while you certainly can provision beefy CPU and GPU nodes with InfiniBand in Azure, cloud services can't obviously simplify this ZeroMQ-based workflow beyond just supplying general-purpose VMs on which the orchestration services can run. Had this team stuck with a file-based streaming mechanism, the performance SLAs on cloud storage (like object or ephemeral Lustre) would provide a more reliable experience to ensure the data transfer happened in near-real-time. But the better solution to unpredictable file system performance is to do exactly what was done here: skip the file system entirely.</p><p style=\"text-align: left;\">Just to keep the speaker honest, I asked why this computation couldn't simply be done at the same place as the telescope generating the data. After all, if the telescope always generates 750 GB per shot, you should be able to buy a couple GPU servers that are ideally sized to process that exact workload in the time between images. There were actually two answers: one from Sam and one from an audience member:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>Sam said that you can process this workflow locally, but that the goal of this work was to prepare for a future microscope (or another instrument) that could not. He also insightfully pointed out that there's tremendous value in getting the data into the HPC environment because of all the services that can be used to work on that data later. I envisioned doing things like using a Jupyter notebook to further process the data, serve it up through a web UI, and similar tasks that cannot be done if the data is stuck inside a microscope room.</li><li>An audience member also pointed out that sticking GPU nodes in the same room as electron microscopes can result in enough noise and vibration to disrupt the actual scope. This was a great point! In the days before I started working in HPC, I was training to become an electron microscopist, and I worked in a lab where we had <a href=\"https://ifmd.lehigh.edu/research-stem\">water-cooled walls</a> to avoid the problems that would be caused by air conditioning breezes. There's no way a loud server would've worked in there.</li></ol><p style=\"text-align: left;\"><b>Toshio Endo (Tokyo Tech)</b> gave an interesting talk on how they enable urgent/interactive compute jobs on their batch-scheduled TSUBAME4.0 supercomputer by doing, frankly, unnatural things. Rather than holding aside some nodes for interactive use as is common practice, his work found that a lot of user jobs do not completely use all resources on each compute node they reserve:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">I had to do a double-take when I saw this: even though 65%-80% of the nodes on the supercomputer were allocated to user jobs, less than 7% of the GPUs were actually being utilized.</p><p style=\"text-align: left;\">Dr. Endo's hypothesis was that if nodes were suitably subdivided and jobs were allowed to oversubscribe CPUs, GPUs, and memory on a compute node without impacting performance too much, they could deliver real-time access to HPC resources without having to create a separate pool of nodes only for interactive uses. He defined success as the slowdown of a shared job being 1/k if k jobs shared the same node; for example, if four jobs were all running on the same node, each one taking four times as long to complete would be acceptable, but any longer would not. He then went on to show that the best way to accomplish this is using <a href=\"https://slurm.schedmd.com/gang_scheduling.html\">Slurm's gang scheduling</a>, where each job takes turns having exclusive access to all the CPUs and GPUs on a node. The alternative (just letting the OS context switch) was no good.</p><p style=\"text-align: left;\">While a fascinating study in how to provide zero wait time to jobs in exchange for reduced performance, this whole mechanism of using gang scheduling to exploit low resource utilization seems like jamming a square peg into a round hole. If a workload doesn't (or can't) use all the GPUs on a node, then that's not the right node for the job; I feel like a more appealing solution would simply be to offer a heterogeneous mix of nodes based on the demands of the workload mix. This is hard to do if you're buying monolithic supercomputers since you're stuck with whatever node mix you've got for five years, but there is another way to buy supercomputers!</p><p style=\"text-align: left;\">I won't pretend like dynamically provisioning different flavors of CPU- and GPU-based nodes interconnected with InfiniBand in the cloud doesn't come with a cost; the convenience of being able to slosh a cluster makeup between CPU-heavy and GPU-heavy nodes will be more expensive than committing to use the same makeup of node flavors for multiple years. But if you're paying for GPUs that are only being used 7% of the time, surely it's cheaper to pay a higher cost for GPUs when you need them if it also allows you to not pay for them 93% of the time when they're idle.</p><p style=\"text-align: left;\">Bjoern Enders (NERSC) gave the first lightning talk where he presented the exploration they're making into enabling real-time and urgent computation. They're currently going in three parallel directions to provide this capability:</p><p style=\"text-align: left;\"></p><ol style=\"text-align: left;\"><li>Reservations, a process by which a user can request a specific number of nodes for a specific period of time, and Slurm ensures that many nodes are available for the exclusive use of that user by the time the reservation starts. He said that implementing this at NERSC is costly and rigid because it requires a human administrator to perform manual steps to register the reservation with Slurm. </li><li>Realtime queues, where a few nodes are held from the regular batch queue and only special real-time users can submit jobs to them. Dr. Enders said that NERSC is extremely selective about who can access this queue for obvious reasons: if too many people use it, it will back up just like the regular batch queues do.</li><li>Jupyter Hub, which utilizes job preemption and backfill under the hood. If a user requests a Jupyter job, Slurm will pre-empt a job that was submitted to a preemptible queue to satisfy the Jupyter request. However, if there are no preemptible jobs running, the Jupyter job will fail to launch after waiting for ten minutes.</li></ol><p style=\"text-align: left;\">To provide compute resources to back up these scheduling capabilities, they also deployed a new set of compute nodes that can be dynamically attached to different supercomputers they have to support urgent workloads even during downtimes.  Called \"Perlmutter on Demand\" (POD), it sounded like a separate set of Cray EX racks that can be assigned to either the Perlmutter supercomputer, or if Perlmutter is down for maintenance, either their smaller Alvarez or Muller supercomputers which share the same Cray EX architecture. What wasn't clear to me is how the Slingshot fabrics of these nodes interact; perhaps POD has its own fabric, and only the control plane owning those racks are what changes.</p><p style=\"text-align: left;\">He showed a slide of explorations they're doing with this POD infrastructure, but as with Dr. Endo's talk, this seemed a bit like a square peg in a round hole:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">All of this sounds aligned with the strengths of what HPC in a cloud environment can deliver, and some of the big challenges (like figuring out the ideal node count to reserve for interactive jobs) are problems specific to Slurm and its mechanism for scheduling. There's a lot more flexibility to rapidly provision HPC resources in cloud environments because, unlike the case where Slurm is scheduling jobs on a single cluster, cloud resource managers can schedule across any number of clusters independently. For example, if an urgent workload needing only four GPU nodes suddenly appears, it doesn't necessarily have to be scheduled on the same InfiniBand fabric that a large hero job is running on. Since the urgent job and the hero job don't need to talk to each other, cloud resource managers can go find a GPU cluster with a little more flex in them to provision those resources quickly.</p><p style=\"text-align: left;\">Automating the process of reservations is also a bit of a game of catch-up, though my guess is that this is more a matter of someone having a weekend to sit down and write the REST service that manages incoming reservation requests. Although there's not a direct analog for reservations like this in Azure, AWS has a feature called <a href=\"https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-capacity-blocks.html\">AWS Capacity Blocks</a> that does exactly this: if you know you'll want a certain number of GPU nodes sometime in the future, Capacity Blocks let you reserve them ahead of time through an API.</p><p></p><p></p><p>Finally, <b>I represented Microsoft</b> and gave a lightning talk that riffed on a lot of what I've been writing about in this blog post: HPC seems to be reinventing a lot of things that the cloud has already figured out how to do. The illustrious Nick Brown was kind enough to <a href=\"https://twitter.com/nickbrownhpc/status/1791129551218590207?s=21&amp;t=7LM0hNWEuk95n8Z_CNZvPg\">snap a photo of one of my slides and post it on Twitter</a>:</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p>My thesis was that the way urgent HPC workflows are triggered, scheduled, run, and reported on follows the same pattern that inferencing-as-a-service services (like Copilot and ChatGPT) are implemented under the hood, right down to executing multi-node jobs on InfiniBand clusters. The difference is that these cloud workflows are built on the foundation of really nice cloud services that provide security, scalability, monitoring, and hands-free management that were originally developed for commercial (not HPC!) customers. My argument was that, even if you don't want to pay cloud providers to run urgent HPC workflows as a managed service, you can use these services (and the software infrastructure on which they're built) as a blueprint for how to build these capabilities in your own HPC environments.</p><h2 id=\"section7\">Concluding thoughts</h2><p>The ISC'24 conference was fantastic, and I am glad it has not lost the unique elements that made me want to attend in the years prior to the pandemic. It's still that smaller, intimate, and focused HPC conference that brings the community together. Although a lot of my synopsis above may sound critical of the content presented over the four days I attended, the fact that I've had so much to write down in this blog post is a testament to the value I really get out of attending: it makes me sit down and think critically about the way the HPC community is evolving, what the leading minds in the field are thinking, and where I might be able to contribute the most in the coming year.</p><p>I never much paid attention to the annual taglines of conferences like ISC, but this year's \"Reinvent HPC\" really resonated. The HPC community is at a crossroads. Exascale computing for science is now in the rear-view mirror, and large-scale AI is all the rage across the computing industry at large. But for the first time ever, this new direction in at-scale computing is happening without the inclusion of the people and organizations who've historically driven innovation in HPC. Whereas institutions like Oak Ridge, RIKEN, Cray, and Fujitsu defined the future of computing for decades, hundred-person startups like OpenAI and Anthropic are now paving the way in partnership with companies like Microsoft and Amazon.</p><p>HPC needs to be reinvented, if for no other reason than to decide whether the HPC community wants to be inclusive of new frontiers in computing that they do not lead. Does the HPC community want AI to be considered a part of HPC?</p><p>Judging from many speakers and panelists, the answer may be \"no.\" To many, it sounded like AI is just another industry that's sucking all the air (and GPUs) out of the room; it's a distraction that is pulling funding and public interest away from solving real problems. It's not something worth understanding, it's not something that uses the familiar tools and libraries, and it's not the product of decades of steady, government-funded improvements. AI is \"them\" and HPC is \"us.\"</p><p>Personally, I'd like the answer to be \"yes\" though. Now that I'm on the other side of the table, supporting AI for a cloud provider, I can say that the technical challenges I face at Microsoft are the same technical challenges I faced in the DOE. The desire to deeply understand systems, optimize applications, and put world-class computing infrastructure in the hands of people who do amazing things is the same. And as the days go by, many of the faces I see are the same; instead of wearing DOE or Cray badges, my lifelong colleagues are now wearing NVIDIA or Microsoft badges.</p><p>All this applies equally to whether cloud is HPC or not. The HPC community needs to reinvent itself to be inclusive of <i>everyone</i> working towards solving the same problems of computing at scale. Stop talking about people who work on commercial AI in cloud-based supercomputers as if they aren't in the room. They are in the room. Often near the front row, snapping photos, and angrily posting commentary on Twitter about how you're getting it all wrong.</p><div class=\"separator\" style=\"clear: both; text-align: center;\"></div><p style=\"text-align: left;\">HPC has historically been used to solve scientific problems, whether to expand our understanding of the university, to find the next best place to drill an oil well, or to model the safety of aging nuclear weapons. The fact that HPC is now being used to solve squishier problems related to natural language or image generation does not change the essence of HPC. And whether that HPC is delivered through physical nodes and networks or virtualized nodes and networks is irrelevant, as long as those resources are still delivering high performance. AI is just as much HPC as scientific computing is, and cloud is just as much HPC as OLCF, R-CCS, or CSCS is.</p><p style=\"text-align: left;\">So perhaps HPC doesn't need to be reinvented as much as the mindset of its community does.</p><p style=\"text-align: left;\">That all said, I am genuinely impressed by how quickly ISC'24 has been reinventing itself in recent years. It wasn't too long ago that all its keynote speakers were greybeards from a predictable pool of public HPC centers all saying the same things year after year. It's wonderful to see a greater diversity of perspectives on the main stage and torches passing on to the next generation of leading figures in the field. And it was not lost on me that, for the first time in the history of this conference, Thomas Sterling did not deliver the closing keynote. As much fun as I had poking fun at his meandering and often-off-the-mark conjectures every year, it was delightful to be exposed to something new this year.</p><p style=\"text-align: left;\">I'm hopeful that ISC will continue to get better year over year, and ISC'25 will feel more inclusive of me despite the fact that I am now one of those hyperscale cloud AI people. So long as I still feel like it's my community, though, I will keep showing up in Germany every summer.</p>",
            "url": "https://hpc.social/personal-blog/2024/isc-24-recap/",
            
            
            
            
            
            "date_published": "2024-05-28T05:24:00-06:00",
            "date_modified": "2024-05-28T05:24:00-06:00",
            
                "author": "Glenn K. Lockwood's Blog"
            
        },
    
        {
            "id": "https://hpc.social/personal-blog/2024/centralized-system-and-lsf-logging-on-a-turing-pi-system/",
            "title": "Centralized system and LSF logging on a Turing Pi system",
            "summary": null,
            "content_text": "Logs are one of those indispensable things in IT when things go wrong. Having worked in technical support for software products in a past life, I’ve likely looked at hundreds (or more) logs over the years, helping to identify issues. So, I really appreciate the importance of logs, but I can honestly say that I never really thought about a logging strategy for the systems on my home network - primarily those running Linux.One of my longtime friends, Peter Czanik, who also works in IT, happens to be a logging guru as well as an IBM Champion for Power Systems (yeah!). So it’s only natural that we get to talking about logging. He is often complaining that even at IT security conferences people are unaware of the importance of central logging. So, why is it so important? For security it’s obvious: logs are stored independently from the compromised system, so they cannot be modified or deleted by the attacker. But central logging is beneficial for the HPC operator as well. First of all, it’s availability. You can read the logs even if one of your nodes becomes unreachable. Instead of trying to breath life into the failed node, you can just take a look at the logs and see a broken hard drive, or a similar deadly problem. And it is also convenience, as all logs are available at a single location. Logging into each node on the 3 node cluster to check locally saved logs is inconvenient but doable. On a 10 node cluster it takes a long time. On a 100 node cluster a couple of working days. While, if your logs are collected to a central location, maybe a single grep command, or search in a Kibana or similar web interface.Those who follow my blog will know that I’ve been tinkering with a Turing Pi V1 system lately. You can read my latest post here. For me, the Turing Pi has always been a cluster in a box. My Turing Pi is fully populated with 7 compute modules. I’ve designed Node 1 to be the NFS server and LSF manager for the cluster. LSF is a workload scheduler for high-performance computing (HPC) from IBM. Naturally I turned to Peter for his guidance on this, and the result is this blog. Peter recommended that I  use syslog-ng for log aggregation and also helped me through some of my first steps with syslog-ng. And the goal was to aggregate both the system (syslog) as well as LSF logs on Node 1. TL;DR it was easy to get it all working. But I encourage you to read on to better understand the nuances and necessary configuration both syslog-ng and LSF that was needed.The environmentThe following software has been deployed on the Turing Pi:Raspberry Pi OS (2023-02-21-raspios-bullseye-arm64-lite.img)syslog-ng 3 – (3.28.1 as supplied with Raspberry Pi OS)IBM LSF Standard Edition V10.1.0.13The Turing Pi system is configured as follows:Node 1 (turingpi) is the manager node of this cluster in a box and has by far the most storage. Naturally we want to use that as the centralized logging server.NodeHostnameHardwareNotes1turingpiCM3+LSF manager, NFS server, 128GB SDcard2kemenyCM34GB eMMC flash3neumannCM3+8GB SDcard4szilardCM3+8GB SDcard5tellerCM3+8GB SDcard6vonkarmanCM3+8GB SDcard7wignerCM3+8GB SDcardSyslog-ng &amp; LSF setupRaspberry Pi OS configures rsyslog out of the box. The first step is to install syslog-ng on Node 1 in the environment. Note that installing syslog-ng automatically disables rsyslog on the nodes.  Output of apt update; apt-get install syslog-ng -y. Click to expand  root@turingpi:~# apt update; apt-get install syslog-ng -y Hit:1 http://security.debian.org/debian-security bullseye-security InReleaseHit:2 http://deb.debian.org/debian bullseye InRelease                                                        Hit:3 http://deb.debian.org/debian bullseye-updates InRelease                                                Hit:4 https://repos.influxdata.com/debian stable InRelease                                                   Hit:5 https://repos.influxdata.com/debian bullseye InRelease                                                 Hit:6 http://archive.raspberrypi.org/debian bullseye InRelease                                  Hit:7 https://packagecloud.io/ookla/speedtest-cli/debian bullseye InRelease                     Reading package lists... DoneBuilding dependency tree... DoneReading state information... DoneAll packages are up to date.Reading package lists... DoneBuilding dependency tree... DoneReading state information... DoneThe following additional packages will be installed:  libbson-1.0-0 libdbi1 libesmtp6 libhiredis0.14 libivykis0 libmaxminddb0 libmongoc-1.0-0 libmongocrypt0  libnet1 libprotobuf-c1 librabbitmq4 librdkafka1 libriemann-client0 libsnappy1v5 libsnmp-base libsnmp40  syslog-ng-core syslog-ng-mod-add-contextual-data syslog-ng-mod-amqp syslog-ng-mod-examples  syslog-ng-mod-extra syslog-ng-mod-geoip2 syslog-ng-mod-getent syslog-ng-mod-graphite syslog-ng-mod-http  syslog-ng-mod-map-value-pairs syslog-ng-mod-mongodb syslog-ng-mod-python syslog-ng-mod-rdkafka  syslog-ng-mod-redis syslog-ng-mod-riemann syslog-ng-mod-slog syslog-ng-mod-smtp syslog-ng-mod-snmp  syslog-ng-mod-sql syslog-ng-mod-stardate syslog-ng-mod-stomp syslog-ng-mod-xml-parserSuggested packages:  mmdb-bin snmp-mibs-downloader rabbitmq-server graphite-web mongodb-server libdbd-mysql libdbd-pgsql  libdbd-sqlite3 activemqThe following packages will be REMOVED:  rsyslogThe following NEW packages will be installed:  libbson-1.0-0 libdbi1 libesmtp6 libhiredis0.14 libivykis0 libmaxminddb0 libmongoc-1.0-0 libmongocrypt0  libnet1 libprotobuf-c1 librabbitmq4 librdkafka1 libriemann-client0 libsnappy1v5 libsnmp-base libsnmp40  syslog-ng syslog-ng-core syslog-ng-mod-add-contextual-data syslog-ng-mod-amqp syslog-ng-mod-examples  syslog-ng-mod-extra syslog-ng-mod-geoip2 syslog-ng-mod-getent syslog-ng-mod-graphite syslog-ng-mod-http  syslog-ng-mod-map-value-pairs syslog-ng-mod-mongodb syslog-ng-mod-python syslog-ng-mod-rdkafka  syslog-ng-mod-redis syslog-ng-mod-riemann syslog-ng-mod-slog syslog-ng-mod-smtp syslog-ng-mod-snmp  syslog-ng-mod-sql syslog-ng-mod-stardate syslog-ng-mod-stomp syslog-ng-mod-xml-parser0 upgraded, 39 newly installed, 1 to remove and 0 not upgraded.Need to get 7,015 kB of archives.After this operation, 15.1 MB of additional disk space will be used.Get:1 http://deb.debian.org/debian bullseye/main arm64 libbson-1.0-0 arm64 1.17.6-1 [69.7 kB]Get:2 http://deb.debian.org/debian bullseye/main arm64 libmongocrypt0 arm64 1.1.0-1 [114 kB]Get:3 http://deb.debian.org/debian bullseye/main arm64 libsnappy1v5 arm64 1.1.8-1 [17.2 kB]Get:4 http://deb.debian.org/debian bullseye/main arm64 libmongoc-1.0-0 arm64 1.17.6-1 [257 kB]Get:5 http://deb.debian.org/debian bullseye/main arm64 libivykis0 arm64 0.42.4-1 [25.3 kB]Get:6 http://deb.debian.org/debian bullseye/main arm64 libnet1 arm64 1.1.6+dfsg-3.1 [56.8 kB]Get:7 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-core arm64 3.28.1-2+deb11u1 [591 kB]Get:8 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-mongodb arm64 3.28.1-2+deb11u1 [37.9 kB]Get:9 http://deb.debian.org/debian bullseye/main arm64 libdbi1 arm64 0.9.0-6 [27.8 kB]Get:10 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-sql arm64 3.28.1-2+deb11u1 [41.5 kB]Get:11 http://deb.debian.org/debian bullseye/main arm64 libesmtp6 arm64 1.0.6-4.3 [52.0 kB]Get:12 http://deb.debian.org/debian bullseye/main arm64 libhiredis0.14 arm64 0.14.1-1 [33.7 kB]Get:13 http://deb.debian.org/debian bullseye/main arm64 libmaxminddb0 arm64 1.5.2-1 [29.6 kB]Get:14 http://deb.debian.org/debian bullseye/main arm64 libprotobuf-c1 arm64 1.3.3-1+b2 [26.8 kB]Get:15 http://deb.debian.org/debian bullseye/main arm64 librabbitmq4 arm64 0.10.0-1 [39.7 kB]Get:16 http://deb.debian.org/debian bullseye/main arm64 librdkafka1 arm64 1.6.0-1 [515 kB]Get:17 http://deb.debian.org/debian bullseye/main arm64 libriemann-client0 arm64 1.10.4-2+b2 [21.9 kB]Get:18 http://deb.debian.org/debian bullseye/main arm64 libsnmp-base all 5.9+dfsg-4+deb11u1 [1,736 kB]Get:19 http://deb.debian.org/debian bullseye/main arm64 libsnmp40 arm64 5.9+dfsg-4+deb11u1 [2,497 kB]Get:20 http://deb.debian.org/debian bullseye/main arm64 syslog-ng all 3.28.1-2+deb11u1 [25.9 kB]Get:21 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-add-contextual-data arm64 3.28.1-2+deb11u1 [40.5 kB]Get:22 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-amqp arm64 3.28.1-2+deb11u1 [48.8 kB]Get:23 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-examples arm64 3.28.1-2+deb11u1 [57.3 kB]Get:24 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-extra all 3.28.1-2+deb11u1 [35.7 kB]Get:25 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-geoip2 arm64 3.28.1-2+deb11u1 [36.9 kB]Get:26 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-graphite arm64 3.28.1-2+deb11u1 [29.4 kB]Get:27 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-http arm64 3.28.1-2+deb11u1 [50.5 kB]Get:28 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-python arm64 3.28.1-2+deb11u1 [69.9 kB]Get:29 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-rdkafka arm64 3.28.1-2+deb11u1 [41.5 kB]Get:30 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-redis arm64 3.28.1-2+deb11u1 [37.6 kB]Get:31 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-riemann arm64 3.28.1-2+deb11u1 [40.1 kB]Get:32 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-slog arm64 3.28.1-2+deb11u1 [63.3 kB]Get:33 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-smtp arm64 3.28.1-2+deb11u1 [38.0 kB]Get:34 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-snmp arm64 3.28.1-2+deb11u1 [42.5 kB]Get:35 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-stomp arm64 3.28.1-2+deb11u1 [39.1 kB]Get:36 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-xml-parser arm64 3.28.1-2+deb11u1 [34.7 kB]Get:37 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-getent arm64 3.28.1-2+deb11u1 [29.5 kB]Get:38 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-map-value-pairs arm64 3.28.1-2+deb11u1 [34.0 kB]Get:39 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-stardate arm64 3.28.1-2+deb11u1 [28.6 kB]Fetched 7,015 kB in 5s (1,311 kB/s)           Extracting templates from packages: 100%(Reading database ... 90182 files and directories currently installed.)Removing rsyslog (8.2102.0-2+deb11u1) ...Selecting previously unselected package libbson-1.0-0.(Reading database ... 90124 files and directories currently installed.)Preparing to unpack .../00-libbson-1.0-0_1.17.6-1_arm64.deb ...Unpacking libbson-1.0-0 (1.17.6-1) ...Selecting previously unselected package libmongocrypt0:arm64.Preparing to unpack .../01-libmongocrypt0_1.1.0-1_arm64.deb ...Unpacking libmongocrypt0:arm64 (1.1.0-1) ...Selecting previously unselected package libsnappy1v5:arm64.Preparing to unpack .../02-libsnappy1v5_1.1.8-1_arm64.deb ...Unpacking libsnappy1v5:arm64 (1.1.8-1) ...Selecting previously unselected package libmongoc-1.0-0.Preparing to unpack .../03-libmongoc-1.0-0_1.17.6-1_arm64.deb ...Unpacking libmongoc-1.0-0 (1.17.6-1) ...Selecting previously unselected package libivykis0:arm64.Preparing to unpack .../04-libivykis0_0.42.4-1_arm64.deb ...Unpacking libivykis0:arm64 (0.42.4-1) ...Selecting previously unselected package libnet1:arm64.Preparing to unpack .../05-libnet1_1.1.6+dfsg-3.1_arm64.deb ...Unpacking libnet1:arm64 (1.1.6+dfsg-3.1) ...Selecting previously unselected package syslog-ng-core.Preparing to unpack .../06-syslog-ng-core_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-core (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-mongodb.Preparing to unpack .../07-syslog-ng-mod-mongodb_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-mongodb (3.28.1-2+deb11u1) ...Selecting previously unselected package libdbi1:arm64.Preparing to unpack .../08-libdbi1_0.9.0-6_arm64.deb ...Unpacking libdbi1:arm64 (0.9.0-6) ...Selecting previously unselected package syslog-ng-mod-sql.Preparing to unpack .../09-syslog-ng-mod-sql_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-sql (3.28.1-2+deb11u1) ...Selecting previously unselected package libesmtp6.Preparing to unpack .../10-libesmtp6_1.0.6-4.3_arm64.deb ...Unpacking libesmtp6 (1.0.6-4.3) ...Selecting previously unselected package libhiredis0.14:arm64.Preparing to unpack .../11-libhiredis0.14_0.14.1-1_arm64.deb ...Unpacking libhiredis0.14:arm64 (0.14.1-1) ...Selecting previously unselected package libmaxminddb0:arm64.Preparing to unpack .../12-libmaxminddb0_1.5.2-1_arm64.deb ...Unpacking libmaxminddb0:arm64 (1.5.2-1) ...Selecting previously unselected package libprotobuf-c1:arm64.Preparing to unpack .../13-libprotobuf-c1_1.3.3-1+b2_arm64.deb ...Unpacking libprotobuf-c1:arm64 (1.3.3-1+b2) ...Selecting previously unselected package librabbitmq4:arm64.Preparing to unpack .../14-librabbitmq4_0.10.0-1_arm64.deb ...Unpacking librabbitmq4:arm64 (0.10.0-1) ...Selecting previously unselected package librdkafka1:arm64.Preparing to unpack .../15-librdkafka1_1.6.0-1_arm64.deb ...Unpacking librdkafka1:arm64 (1.6.0-1) ...Selecting previously unselected package libriemann-client0:arm64.Preparing to unpack .../16-libriemann-client0_1.10.4-2+b2_arm64.deb ...Unpacking libriemann-client0:arm64 (1.10.4-2+b2) ...Selecting previously unselected package libsnmp-base.Preparing to unpack .../17-libsnmp-base_5.9+dfsg-4+deb11u1_all.deb ...Unpacking libsnmp-base (5.9+dfsg-4+deb11u1) ...Selecting previously unselected package libsnmp40:arm64.Preparing to unpack .../18-libsnmp40_5.9+dfsg-4+deb11u1_arm64.deb ...Unpacking libsnmp40:arm64 (5.9+dfsg-4+deb11u1) ...Selecting previously unselected package syslog-ng.Preparing to unpack .../19-syslog-ng_3.28.1-2+deb11u1_all.deb ...Unpacking syslog-ng (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-add-contextual-data.Preparing to unpack .../20-syslog-ng-mod-add-contextual-data_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-add-contextual-data (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-amqp.Preparing to unpack .../21-syslog-ng-mod-amqp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-amqp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-examples.Preparing to unpack .../22-syslog-ng-mod-examples_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-examples (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-extra.Preparing to unpack .../23-syslog-ng-mod-extra_3.28.1-2+deb11u1_all.deb ...Unpacking syslog-ng-mod-extra (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-geoip2.Preparing to unpack .../24-syslog-ng-mod-geoip2_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-geoip2 (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-graphite.Preparing to unpack .../25-syslog-ng-mod-graphite_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-graphite (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-http.Preparing to unpack .../26-syslog-ng-mod-http_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-http (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-python.Preparing to unpack .../27-syslog-ng-mod-python_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-python (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-rdkafka.Preparing to unpack .../28-syslog-ng-mod-rdkafka_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-rdkafka (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-redis.Preparing to unpack .../29-syslog-ng-mod-redis_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-redis (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-riemann.Preparing to unpack .../30-syslog-ng-mod-riemann_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-riemann (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-slog.Preparing to unpack .../31-syslog-ng-mod-slog_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-slog (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-smtp.Preparing to unpack .../32-syslog-ng-mod-smtp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-smtp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-snmp.Preparing to unpack .../33-syslog-ng-mod-snmp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-snmp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-stomp.Preparing to unpack .../34-syslog-ng-mod-stomp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-stomp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-xml-parser.Preparing to unpack .../35-syslog-ng-mod-xml-parser_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-xml-parser (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-getent.Preparing to unpack .../36-syslog-ng-mod-getent_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-getent (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-map-value-pairs.Preparing to unpack .../37-syslog-ng-mod-map-value-pairs_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-map-value-pairs (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-stardate.Preparing to unpack .../38-syslog-ng-mod-stardate_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-stardate (3.28.1-2+deb11u1) ...Setting up librabbitmq4:arm64 (0.10.0-1) ...Setting up libdbi1:arm64 (0.9.0-6) ...Setting up libsnmp-base (5.9+dfsg-4+deb11u1) ...Setting up libmaxminddb0:arm64 (1.5.2-1) ...Setting up libesmtp6 (1.0.6-4.3) ...Setting up libnet1:arm64 (1.1.6+dfsg-3.1) ...Setting up libprotobuf-c1:arm64 (1.3.3-1+b2) ...Setting up libsnappy1v5:arm64 (1.1.8-1) ...Setting up libsnmp40:arm64 (5.9+dfsg-4+deb11u1) ...Setting up libbson-1.0-0 (1.17.6-1) ...Setting up libivykis0:arm64 (0.42.4-1) ...Setting up libriemann-client0:arm64 (1.10.4-2+b2) ...Setting up librdkafka1:arm64 (1.6.0-1) ...Setting up libhiredis0.14:arm64 (0.14.1-1) ...Setting up libmongocrypt0:arm64 (1.1.0-1) ...Setting up libmongoc-1.0-0 (1.17.6-1) ...Setting up syslog-ng-core (3.28.1-2+deb11u1) ...Created symlink /etc/systemd/system/multi-user.target.wants/syslog-ng.service → /lib/systemd/system/syslog-ng.service.Setting up syslog-ng-mod-examples (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-xml-parser (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-stomp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-riemann (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-stardate (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-geoip2 (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-getent (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-amqp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-python (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-smtp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-snmp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-extra (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-rdkafka (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-graphite (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-add-contextual-data (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-mongodb (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-http (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-slog (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-map-value-pairs (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-sql (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-redis (3.28.1-2+deb11u1) ...Setting up syslog-ng (3.28.1-2+deb11u1) ...Processing triggers for man-db (2.9.4-2) ...Processing triggers for libc-bin (2.31-13+rpt2+rpi1+deb11u8) ...Scanning processes...                                                                                         Scanning processor microcode...                                                                               Scanning linux images...                                                                                      Running kernel seems to be up-to-date.Failed to check for processor microcode upgrades.No services need to be restarted.No containers need to be restarted.No user sessions are running outdated binaries.2. With syslog-ng installed, it’s now time to build the configuration for it. A new configuration file fromnet.conf is shown below, in which a syslog-ng destination is created which will aggregate logs from the Turing Pi nodes in /var/log/fromnet in plain text format. Additionally, the logs will be written in JSON format to the file /var/log/fromnet.json.root@turingpi:~# cat /etc/syslog-ng/fromnet.conf # sourcesource s_fromnet {  syslog(port(601));};# destination destination d_fromnet {  file(\"/var/log/fromnet\");  file(\"/var/log/fromnet.json\" template(\"$(format-json --scope rfc5424 --scope dot-nv-pairs        --rekey .* --shift 1 --scope nv-pairs)\\n\") );};# log pathlog {  source(s_fromnet);  destination(d_fromnet);}; Unless we only want to see source IP addresses in the collected logs, it’s necessary to update the syslog-ng configuration file /etc/syslog-ng/syslog-ng.conf to record the hostnames from which the log messages have originated. This is done by adding the keep_hostname(yes) parameter to the options section as follows:........# First, set some global options. options { chain_hostnames(off); flush_lines(0); use_dns(no); use_fqdn(no);                  keep_hostname(yes);dns_cache(no); owner(\"root\"); group(\"adm\"); perm(0640);         stats_freq(0); bad_hostname(\"^gconfd$\"); };........Next, the IBM LSF configuration is updated to prevent the creation of local logfiles for the LSF daemons. This is done by commenting the LSF_LOGDIR option in the configuration file $LSF_ENVDIR/lsf.conf. At the same time, we also set LSF_LOG_MASK=LOG_DEBUG for testing purposes to enable verbose logging for the LSF daemons.........# Daemon log messages# LSF_LOGDIR=/opt/ibm/lsf/logLSF_LOG_MASK=LOG_DEBUG........Finally, to make the changes take effect, both syslog-ng and LSF are restarted.root@turingpi:~# systemctl restart syslog-ng root@turingpi:~# . /opt/ibm/lsf/conf/profile.lsf  root@turingpi:~# lsf_daemons restart Stopping the LSF subsystem Starting the LSF subsystemWith the configuration ready on the centralized logging server, host turingpi, we now turn our attention to Nodes 2-7 in the cluster. Here we’ll use the parallel-ssh tool to streamline some operations. We start with the installation of syslog-ng across Nodes 2-7. Note that the output of the installation of syslog-ng across the compute nodes has been truncated.  Truncated output of parallel-ssh -h /opt/workers -i &ldquo;apt-get install syslog-ng -y&rdquo;. Click to expand  root@turingpi:~# parallel-ssh -h /opt/workers -i \"apt-get install syslog-ng -y\" [1] 13:57:07 [SUCCESS] kemenyReading package lists...Building dependency tree...Reading state information...The following additional packages will be installed:  libbson-1.0-0 libdbi1 libesmtp6 libhiredis0.14 libivykis0 libmaxminddb0  libmongoc-1.0-0 libmongocrypt0 libnet1 libprotobuf-c1 librabbitmq4  librdkafka1 libriemann-client0 libsensors-config libsensors5 libsnappy1v5  libsnmp-base libsnmp40 syslog-ng-core syslog-ng-mod-add-contextual-data  syslog-ng-mod-amqp syslog-ng-mod-examples syslog-ng-mod-extra  syslog-ng-mod-geoip2 syslog-ng-mod-getent syslog-ng-mod-graphite  syslog-ng-mod-http syslog-ng-mod-map-value-pairs syslog-ng-mod-mongodb  syslog-ng-mod-python syslog-ng-mod-rdkafka syslog-ng-mod-redis  syslog-ng-mod-riemann syslog-ng-mod-slog syslog-ng-mod-smtp  syslog-ng-mod-snmp syslog-ng-mod-sql syslog-ng-mod-stardate  syslog-ng-mod-stomp syslog-ng-mod-xml-parserSuggested packages:  mmdb-bin lm-sensors snmp-mibs-downloader rabbitmq-server graphite-web  mongodb-server libdbd-mysql libdbd-pgsql libdbd-sqlite3 activemqThe following packages will be REMOVED:  rsyslogThe following NEW packages will be installed:  libbson-1.0-0 libdbi1 libesmtp6 libhiredis0.14 libivykis0 libmaxminddb0  libmongoc-1.0-0 libmongocrypt0 libnet1 libprotobuf-c1 librabbitmq4  librdkafka1 libriemann-client0 libsensors-config libsensors5 libsnappy1v5  libsnmp-base libsnmp40 syslog-ng syslog-ng-core  syslog-ng-mod-add-contextual-data syslog-ng-mod-amqp syslog-ng-mod-examples  syslog-ng-mod-extra syslog-ng-mod-geoip2 syslog-ng-mod-getent  syslog-ng-mod-graphite syslog-ng-mod-http syslog-ng-mod-map-value-pairs  syslog-ng-mod-mongodb syslog-ng-mod-python syslog-ng-mod-rdkafka  syslog-ng-mod-redis syslog-ng-mod-riemann syslog-ng-mod-slog  syslog-ng-mod-smtp syslog-ng-mod-snmp syslog-ng-mod-sql  syslog-ng-mod-stardate syslog-ng-mod-stomp syslog-ng-mod-xml-parser0 upgraded, 41 newly installed, 1 to remove and 0 not upgraded.Need to get 7,098 kB of archives.After this operation, 15.3 MB of additional disk space will be used.Get:1 http://deb.debian.org/debian bullseye/main arm64 libbson-1.0-0 arm64 1.17.6-1 [69.7 kB]Get:2 http://deb.debian.org/debian bullseye/main arm64 libmongocrypt0 arm64 1.1.0-1 [114 kB]Get:3 http://deb.debian.org/debian bullseye/main arm64 libsnappy1v5 arm64 1.1.8-1 [17.2 kB]Get:4 http://deb.debian.org/debian bullseye/main arm64 libmongoc-1.0-0 arm64 1.17.6-1 [257 kB]Get:5 http://deb.debian.org/debian bullseye/main arm64 libivykis0 arm64 0.42.4-1 [25.3 kB]Get:6 http://deb.debian.org/debian bullseye/main arm64 libnet1 arm64 1.1.6+dfsg-3.1 [56.8 kB]Get:7 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-core arm64 3.28.1-2+deb11u1 [591 kB]Get:8 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-mongodb arm64 3.28.1-2+deb11u1 [37.9 kB]Get:9 http://deb.debian.org/debian bullseye/main arm64 libdbi1 arm64 0.9.0-6 [27.8 kB]Get:10 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-sql arm64 3.28.1-2+deb11u1 [41.5 kB]Get:11 http://deb.debian.org/debian bullseye/main arm64 libesmtp6 arm64 1.0.6-4.3 [52.0 kB]Get:12 http://deb.debian.org/debian bullseye/main arm64 libhiredis0.14 arm64 0.14.1-1 [33.7 kB]Get:13 http://deb.debian.org/debian bullseye/main arm64 libmaxminddb0 arm64 1.5.2-1 [29.6 kB]Get:14 http://deb.debian.org/debian bullseye/main arm64 libprotobuf-c1 arm64 1.3.3-1+b2 [26.8 kB]Get:15 http://deb.debian.org/debian bullseye/main arm64 librabbitmq4 arm64 0.10.0-1 [39.7 kB]Get:16 http://deb.debian.org/debian bullseye/main arm64 librdkafka1 arm64 1.6.0-1 [515 kB]Get:17 http://deb.debian.org/debian bullseye/main arm64 libriemann-client0 arm64 1.10.4-2+b2 [21.9 kB]Get:18 http://deb.debian.org/debian bullseye/main arm64 libsensors-config all 1:3.6.0-7 [32.3 kB]Get:19 http://deb.debian.org/debian bullseye/main arm64 libsensors5 arm64 1:3.6.0-7 [51.2 kB]Get:20 http://deb.debian.org/debian bullseye/main arm64 libsnmp-base all 5.9+dfsg-4+deb11u1 [1,736 kB]Get:21 http://deb.debian.org/debian bullseye/main arm64 libsnmp40 arm64 5.9+dfsg-4+deb11u1 [2,497 kB]Get:22 http://deb.debian.org/debian bullseye/main arm64 syslog-ng all 3.28.1-2+deb11u1 [25.9 kB]Get:23 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-add-contextual-data arm64 3.28.1-2+deb11u1 [40.5 kB]Get:24 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-amqp arm64 3.28.1-2+deb11u1 [48.8 kB]Get:25 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-examples arm64 3.28.1-2+deb11u1 [57.3 kB]Get:26 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-extra all 3.28.1-2+deb11u1 [35.7 kB]Get:27 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-geoip2 arm64 3.28.1-2+deb11u1 [36.9 kB]Get:28 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-graphite arm64 3.28.1-2+deb11u1 [29.4 kB]Get:29 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-http arm64 3.28.1-2+deb11u1 [50.5 kB]Get:30 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-python arm64 3.28.1-2+deb11u1 [69.9 kB]Get:31 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-rdkafka arm64 3.28.1-2+deb11u1 [41.5 kB]Get:32 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-redis arm64 3.28.1-2+deb11u1 [37.6 kB]Get:33 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-riemann arm64 3.28.1-2+deb11u1 [40.1 kB]Get:34 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-slog arm64 3.28.1-2+deb11u1 [63.3 kB]Get:35 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-smtp arm64 3.28.1-2+deb11u1 [38.0 kB]Get:36 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-snmp arm64 3.28.1-2+deb11u1 [42.5 kB]Get:37 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-stomp arm64 3.28.1-2+deb11u1 [39.1 kB]Get:38 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-xml-parser arm64 3.28.1-2+deb11u1 [34.7 kB]Get:39 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-getent arm64 3.28.1-2+deb11u1 [29.5 kB]Get:40 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-map-value-pairs arm64 3.28.1-2+deb11u1 [34.0 kB]Get:41 http://deb.debian.org/debian bullseye/main arm64 syslog-ng-mod-stardate arm64 3.28.1-2+deb11u1 [28.6 kB]Fetched 7,098 kB in 2s (3,566 kB/s)(Reading database ... 37650 files and directories currently installed.)Removing rsyslog (8.2102.0-2+deb11u1) ...Selecting previously unselected package libbson-1.0-0.(Reading database ... 37592 files and directories currently installed.)Preparing to unpack .../00-libbson-1.0-0_1.17.6-1_arm64.deb ...Unpacking libbson-1.0-0 (1.17.6-1) ...Selecting previously unselected package libmongocrypt0:arm64.Preparing to unpack .../01-libmongocrypt0_1.1.0-1_arm64.deb ...Unpacking libmongocrypt0:arm64 (1.1.0-1) ...Selecting previously unselected package libsnappy1v5:arm64.Preparing to unpack .../02-libsnappy1v5_1.1.8-1_arm64.deb ...Unpacking libsnappy1v5:arm64 (1.1.8-1) ...Selecting previously unselected package libmongoc-1.0-0.Preparing to unpack .../03-libmongoc-1.0-0_1.17.6-1_arm64.deb ...Unpacking libmongoc-1.0-0 (1.17.6-1) ...Selecting previously unselected package libivykis0:arm64.Preparing to unpack .../04-libivykis0_0.42.4-1_arm64.deb ...Unpacking libivykis0:arm64 (0.42.4-1) ...Selecting previously unselected package libnet1:arm64.Preparing to unpack .../05-libnet1_1.1.6+dfsg-3.1_arm64.deb ...Unpacking libnet1:arm64 (1.1.6+dfsg-3.1) ...Selecting previously unselected package syslog-ng-core.Preparing to unpack .../06-syslog-ng-core_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-core (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-mongodb.Preparing to unpack .../07-syslog-ng-mod-mongodb_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-mongodb (3.28.1-2+deb11u1) ...Selecting previously unselected package libdbi1:arm64.Preparing to unpack .../08-libdbi1_0.9.0-6_arm64.deb ...Unpacking libdbi1:arm64 (0.9.0-6) ...Selecting previously unselected package syslog-ng-mod-sql.Preparing to unpack .../09-syslog-ng-mod-sql_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-sql (3.28.1-2+deb11u1) ...Selecting previously unselected package libesmtp6.Preparing to unpack .../10-libesmtp6_1.0.6-4.3_arm64.deb ...Unpacking libesmtp6 (1.0.6-4.3) ...Selecting previously unselected package libhiredis0.14:arm64.Preparing to unpack .../11-libhiredis0.14_0.14.1-1_arm64.deb ...Unpacking libhiredis0.14:arm64 (0.14.1-1) ...Selecting previously unselected package libmaxminddb0:arm64.Preparing to unpack .../12-libmaxminddb0_1.5.2-1_arm64.deb ...Unpacking libmaxminddb0:arm64 (1.5.2-1) ...Selecting previously unselected package libprotobuf-c1:arm64.Preparing to unpack .../13-libprotobuf-c1_1.3.3-1+b2_arm64.deb ...Unpacking libprotobuf-c1:arm64 (1.3.3-1+b2) ...Selecting previously unselected package librabbitmq4:arm64.Preparing to unpack .../14-librabbitmq4_0.10.0-1_arm64.deb ...Unpacking librabbitmq4:arm64 (0.10.0-1) ...Selecting previously unselected package librdkafka1:arm64.Preparing to unpack .../15-librdkafka1_1.6.0-1_arm64.deb ...Unpacking librdkafka1:arm64 (1.6.0-1) ...Selecting previously unselected package libriemann-client0:arm64.Preparing to unpack .../16-libriemann-client0_1.10.4-2+b2_arm64.deb ...Unpacking libriemann-client0:arm64 (1.10.4-2+b2) ...Selecting previously unselected package libsensors-config.Preparing to unpack .../17-libsensors-config_1%3a3.6.0-7_all.deb ...Unpacking libsensors-config (1:3.6.0-7) ...Selecting previously unselected package libsensors5:arm64.Preparing to unpack .../18-libsensors5_1%3a3.6.0-7_arm64.deb ...Unpacking libsensors5:arm64 (1:3.6.0-7) ...Selecting previously unselected package libsnmp-base.Preparing to unpack .../19-libsnmp-base_5.9+dfsg-4+deb11u1_all.deb ...Unpacking libsnmp-base (5.9+dfsg-4+deb11u1) ...Selecting previously unselected package libsnmp40:arm64.Preparing to unpack .../20-libsnmp40_5.9+dfsg-4+deb11u1_arm64.deb ...Unpacking libsnmp40:arm64 (5.9+dfsg-4+deb11u1) ...Selecting previously unselected package syslog-ng.Preparing to unpack .../21-syslog-ng_3.28.1-2+deb11u1_all.deb ...Unpacking syslog-ng (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-add-contextual-data.Preparing to unpack .../22-syslog-ng-mod-add-contextual-data_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-add-contextual-data (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-amqp.Preparing to unpack .../23-syslog-ng-mod-amqp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-amqp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-examples.Preparing to unpack .../24-syslog-ng-mod-examples_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-examples (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-extra.Preparing to unpack .../25-syslog-ng-mod-extra_3.28.1-2+deb11u1_all.deb ...Unpacking syslog-ng-mod-extra (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-geoip2.Preparing to unpack .../26-syslog-ng-mod-geoip2_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-geoip2 (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-graphite.Preparing to unpack .../27-syslog-ng-mod-graphite_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-graphite (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-http.Preparing to unpack .../28-syslog-ng-mod-http_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-http (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-python.Preparing to unpack .../29-syslog-ng-mod-python_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-python (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-rdkafka.Preparing to unpack .../30-syslog-ng-mod-rdkafka_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-rdkafka (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-redis.Preparing to unpack .../31-syslog-ng-mod-redis_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-redis (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-riemann.Preparing to unpack .../32-syslog-ng-mod-riemann_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-riemann (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-slog.Preparing to unpack .../33-syslog-ng-mod-slog_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-slog (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-smtp.Preparing to unpack .../34-syslog-ng-mod-smtp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-smtp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-snmp.Preparing to unpack .../35-syslog-ng-mod-snmp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-snmp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-stomp.Preparing to unpack .../36-syslog-ng-mod-stomp_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-stomp (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-xml-parser.Preparing to unpack .../37-syslog-ng-mod-xml-parser_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-xml-parser (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-getent.Preparing to unpack .../38-syslog-ng-mod-getent_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-getent (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-map-value-pairs.Preparing to unpack .../39-syslog-ng-mod-map-value-pairs_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-map-value-pairs (3.28.1-2+deb11u1) ...Selecting previously unselected package syslog-ng-mod-stardate.Preparing to unpack .../40-syslog-ng-mod-stardate_3.28.1-2+deb11u1_arm64.deb ...Unpacking syslog-ng-mod-stardate (3.28.1-2+deb11u1) ...Setting up librabbitmq4:arm64 (0.10.0-1) ...Setting up libdbi1:arm64 (0.9.0-6) ...Setting up libsnmp-base (5.9+dfsg-4+deb11u1) ...Setting up libmaxminddb0:arm64 (1.5.2-1) ...Setting up libsensors-config (1:3.6.0-7) ...Setting up libesmtp6 (1.0.6-4.3) ...Setting up libnet1:arm64 (1.1.6+dfsg-3.1) ...Setting up libprotobuf-c1:arm64 (1.3.3-1+b2) ...Setting up libsnappy1v5:arm64 (1.1.8-1) ...Setting up libbson-1.0-0 (1.17.6-1) ...Setting up libivykis0:arm64 (0.42.4-1) ...Setting up libriemann-client0:arm64 (1.10.4-2+b2) ...Setting up libsensors5:arm64 (1:3.6.0-7) ...Setting up librdkafka1:arm64 (1.6.0-1) ...Setting up libhiredis0.14:arm64 (0.14.1-1) ...Setting up libmongocrypt0:arm64 (1.1.0-1) ...Setting up libsnmp40:arm64 (5.9+dfsg-4+deb11u1) ...Setting up libmongoc-1.0-0 (1.17.6-1) ...Setting up syslog-ng-core (3.28.1-2+deb11u1) ...Created symlink /etc/systemd/system/multi-user.target.wants/syslog-ng.service → /lib/systemd/system/syslog-ng.service.Setting up syslog-ng-mod-examples (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-xml-parser (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-stomp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-riemann (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-stardate (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-geoip2 (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-getent (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-amqp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-python (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-smtp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-snmp (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-extra (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-rdkafka (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-graphite (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-add-contextual-data (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-mongodb (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-http (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-slog (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-map-value-pairs (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-sql (3.28.1-2+deb11u1) ...Setting up syslog-ng-mod-redis (3.28.1-2+deb11u1) ...Setting up syslog-ng (3.28.1-2+deb11u1) ...Processing triggers for man-db (2.9.4-2) ...Processing triggers for libc-bin (2.31-13+rpt2+rpi1+deb11u8) ...Stderr: debconf: unable to initialize frontend: Dialogdebconf: (TERM is not set, so the dialog frontend is not usable.)debconf: falling back to frontend: Readlinedebconf: unable to initialize frontend: Readlinedebconf: (This frontend requires a controlling tty.)debconf: falling back to frontend: Teletypedpkg-preconfigure: unable to re-open stdin: ........7. Following the installation of syslog-ng across Nodes 2-7. We verify that the installation was successful by checking the syslog-ng service status.  Output of parallel-ssh -h /opt/workers -i &ldquo;systemctl status syslog-ng&rdquo;. Click to expand  root@turingpi:~# parallel-ssh -h /opt/workers -i \"systemctl status syslog-ng\" [1] 14:03:46 [SUCCESS] kemeny● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 13:57:01 EDT; 6min ago       Docs: man:syslog-ng(8)   Main PID: 28694 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 40.228s     CGroup: /system.slice/syslog-ng.service             └─28694 /usr/sbin/syslog-ng -FMar 28 13:57:00 kemeny systemd[1]: Starting System Logger Daemon...Mar 28 13:57:01 kemeny syslog-ng[28694]: DIGEST-MD5 common mech freeMar 28 13:57:01 kemeny systemd[1]: Started System Logger Daemon.[2] 14:03:50 [SUCCESS] vonkarman● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 13:57:49 EDT; 5min ago       Docs: man:syslog-ng(8)   Main PID: 27486 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 2min 5.540s     CGroup: /system.slice/syslog-ng.service             └─27486 /usr/sbin/syslog-ng -FMar 28 13:57:44 vonkarman systemd[1]: Starting System Logger Daemon...Mar 28 13:57:46 vonkarman syslog-ng[27486]: DIGEST-MD5 common mech freeMar 28 13:57:49 vonkarman systemd[1]: Started System Logger Daemon.[3] 14:03:51 [SUCCESS] teller● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 13:57:39 EDT; 6min ago       Docs: man:syslog-ng(8)   Main PID: 24821 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 2min 262ms     CGroup: /system.slice/syslog-ng.service             └─24821 /usr/sbin/syslog-ng -FMar 28 13:57:38 teller systemd[1]: Starting System Logger Daemon...Mar 28 13:57:38 teller syslog-ng[24821]: DIGEST-MD5 common mech freeMar 28 13:57:39 teller systemd[1]: Started System Logger Daemon.[4] 14:03:53 [SUCCESS] neumann● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 13:57:39 EDT; 6min ago       Docs: man:syslog-ng(8)   Main PID: 27734 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 1min 43.504s     CGroup: /system.slice/syslog-ng.service             └─27734 /usr/sbin/syslog-ng -FMar 28 13:57:38 neumann systemd[1]: Starting System Logger Daemon...Mar 28 13:57:38 neumann syslog-ng[27734]: DIGEST-MD5 common mech freeMar 28 13:57:39 neumann systemd[1]: Started System Logger Daemon.[5] 14:03:53 [SUCCESS] wigner● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 13:57:37 EDT; 6min ago       Docs: man:syslog-ng(8)   Main PID: 27512 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 1min 49.643s     CGroup: /system.slice/syslog-ng.service             └─27512 /usr/sbin/syslog-ng -FMar 28 13:57:36 wigner systemd[1]: Starting System Logger Daemon...Mar 28 13:57:36 wigner syslog-ng[27512]: DIGEST-MD5 common mech freeMar 28 13:57:37 wigner systemd[1]: Started System Logger Daemon.[6] 14:03:57 [SUCCESS] szilard● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 13:57:35 EDT; 6min ago       Docs: man:syslog-ng(8)   Main PID: 24136 (syslog-ng)      Tasks: 5 (limit: 779)        CPU: 2min 10.257s     CGroup: /system.slice/syslog-ng.service             └─24136 /usr/sbin/syslog-ng -FMar 28 13:57:34 szilard systemd[1]: Starting System Logger Daemon...Mar 28 13:57:34 szilard syslog-ng[24136]: DIGEST-MD5 common mech freeMar 28 13:57:35 szilard systemd[1]: Started System Logger Daemon.8. Create the  configuration file send.conf in /opt on host turingpi. Note that /opt is an NFS export on turingpi and is NFS mounted by all of the compute nodes. This file will set the HOST field to the local hostname for log messages that are sent. This in done in the subsequent steps where “placeholder” will be replaced using a sed operation with the local hostname. Additionally, a data source s_hpc is defined which will scan /opt/ibm/lsf/log for the presence of LSF daemon logfiles. root@turingpi:/# cat /opt/send.confrewrite r_host { set(\"placeholder\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_src);  source(s_hpc);  rewrite(r_host);   destination(d_net);};On Nodes 2-7, copy the file /opt/send.conf to /etc/syslog-ng/conf.d/send.conf. root@turingpi:/# parallel-ssh -h /opt/workers -i \"cp /opt/send.conf /etc/syslog-ng/conf.d\" [1] 14:19:29 [SUCCESS] kemeny[2] 14:19:30 [SUCCESS] vonkarman[3] 14:19:30 [SUCCESS] wigner[4] 14:19:30 [SUCCESS] szilard[5] 14:19:30 [SUCCESS] teller[6] 14:19:31 [SUCCESS] neumannUsing sed, replace the “placeholder” string in /etc/syslog-ng/conf.d/send.conf with the local hostname. And we also double check that the change was correctly made. root@turingpi:/# parallel-ssh -h /opt/workers -i 'HOST=`hostname`; sed -i \"s/placeholder/$HOST/g\" /etc/syslog-ng/conf.d/send.conf' [1] 14:38:09 [SUCCESS] kemeny[2] 14:38:09 [SUCCESS] teller[3] 14:38:09 [SUCCESS] vonkarman[4] 14:38:09 [SUCCESS] wigner[5] 14:38:09 [SUCCESS] neumann[6] 14:38:09 [SUCCESS] szilard  Output of parallel-ssh -h /opt/workers -i &ldquo;cat /etc/syslog-ng/conf.d/send.conf&rdquo;. Click to expand  root@turingpi:/# parallel-ssh -h /opt/workers -i \"cat /etc/syslog-ng/conf.d/send.conf\" [1] 14:38:33 [SUCCESS] kemenyrewrite r_host { set(\"kemeny\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[2] 14:38:33 [SUCCESS] tellerrewrite r_host { set(\"teller\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[3] 14:38:33 [SUCCESS] neumannrewrite r_host { set(\"neumann\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[4] 14:38:33 [SUCCESS] szilardrewrite r_host { set(\"szilard\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[5] 14:38:33 [SUCCESS] wignerrewrite r_host { set(\"wigner\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[6] 14:38:33 [SUCCESS] vonkarmanrewrite r_host { set(\"vonkarman\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};11. Finally, syslog-ng is restarted on Nodes 2-7 and the status of the service is checked to ensure that there are no errors. root@turingpi:/opt# parallel-ssh -h /opt/workers -i \"systemctl restart syslog-ng\" [1] 14:49:03 [SUCCESS] kemeny[2] 14:49:05 [SUCCESS] szilard[3] 14:49:06 [SUCCESS] vonkarman[4] 14:49:06 [SUCCESS] neumann[5] 14:49:06 [SUCCESS] teller[6] 14:49:07 [SUCCESS] wigner  Output of parallel-ssh -h /opt/workers -i &ldquo;systemctl status syslog-ng&rdquo;. Click to expand  root@turingpi:/opt# parallel-ssh -h /opt/workers -i \"systemctl status syslog-ng\" [1] 14:49:31 [SUCCESS] kemeny● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 14:49:03 EDT; 28s ago       Docs: man:syslog-ng(8)   Main PID: 34982 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 398ms     CGroup: /system.slice/syslog-ng.service             └─34982 /usr/sbin/syslog-ng -FMar 28 14:49:02 kemeny systemd[1]: Starting System Logger Daemon...Mar 28 14:49:02 kemeny syslog-ng[34982]: DIGEST-MD5 common mech freeMar 28 14:49:03 kemeny systemd[1]: Started System Logger Daemon.[2] 14:49:33 [SUCCESS] vonkarman● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 14:49:06 EDT; 25s ago       Docs: man:syslog-ng(8)   Main PID: 33710 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 934ms     CGroup: /system.slice/syslog-ng.service             └─33710 /usr/sbin/syslog-ng -FMar 28 14:49:03 vonkarman systemd[1]: Starting System Logger Daemon...Mar 28 14:49:03 vonkarman syslog-ng[33710]: DIGEST-MD5 common mech freeMar 28 14:49:06 vonkarman systemd[1]: Started System Logger Daemon.[3] 14:49:33 [SUCCESS] neumann● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 14:49:06 EDT; 25s ago       Docs: man:syslog-ng(8)   Main PID: 34000 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 959ms     CGroup: /system.slice/syslog-ng.service             └─34000 /usr/sbin/syslog-ng -FMar 28 14:49:03 neumann systemd[1]: Starting System Logger Daemon...Mar 28 14:49:03 neumann syslog-ng[34000]: DIGEST-MD5 common mech freeMar 28 14:49:06 neumann systemd[1]: Started System Logger Daemon.[4] 14:49:33 [SUCCESS] wigner● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 14:49:07 EDT; 25s ago       Docs: man:syslog-ng(8)   Main PID: 33941 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 1.115s     CGroup: /system.slice/syslog-ng.service             └─33941 /usr/sbin/syslog-ng -FMar 28 14:49:03 wigner systemd[1]: Starting System Logger Daemon...Mar 28 14:49:04 wigner syslog-ng[33941]: DIGEST-MD5 common mech freeMar 28 14:49:07 wigner systemd[1]: Started System Logger Daemon.[5] 14:49:34 [SUCCESS] szilard● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 14:49:05 EDT; 26s ago       Docs: man:syslog-ng(8)   Main PID: 30348 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 816ms     CGroup: /system.slice/syslog-ng.service             └─30348 /usr/sbin/syslog-ng -FMar 28 14:49:03 szilard systemd[1]: Starting System Logger Daemon...Mar 28 14:49:03 szilard syslog-ng[30348]: DIGEST-MD5 common mech freeMar 28 14:49:05 szilard systemd[1]: Started System Logger Daemon.[6] 14:49:34 [SUCCESS] teller● syslog-ng.service - System Logger Daemon     Loaded: loaded (/lib/systemd/system/syslog-ng.service; enabled; vendor preset: enabled)     Active: active (running) since Thu 2024-03-28 14:49:06 EDT; 25s ago       Docs: man:syslog-ng(8)   Main PID: 31034 (syslog-ng)      Tasks: 2 (limit: 779)        CPU: 965ms     CGroup: /system.slice/syslog-ng.service             └─31034 /usr/sbin/syslog-ng -FDoes it work?The answer to this question is an emphatic YES!Let’s begin with a simple test running the logger command on all of the compute nodes, while monitoring /var/log/fromnet on host turingpi. root@turingpi:/home/lsfadmin# date; parallel-ssh -h /opt/workers -i 'HOST=`hostname`; logger This is a test from node $HOST. Do not panic!' Wed  3 Apr 21:41:45 EDT 2024 [1] 21:41:46 [SUCCESS] teller [2] 21:41:46 [SUCCESS] neumann [3] 21:41:46 [SUCCESS] wigner [4] 21:41:46 [SUCCESS] kemeny [5] 21:41:46 [SUCCESS] szilard [6] 21:41:46 [SUCCESS] vonkarmanroot@turingpi:/var/log# tail -f fromnet |grep panic Apr  3 21:41:46 szilard root[10918]: This is a test from node szilard. Do not panic! Apr  3 21:41:46 wigner root[11011]: This is a test from node wigner. Do not panic! Apr  3 21:41:46 neumann root[11121]: This is a test from node neumann. Do not panic! Apr  3 21:41:46 kemeny root[11029]: This is a test from node kemeny. Do not panic! Apr  3 21:41:46 teller root[10875]: This is a test from node teller. Do not panic! Apr  3 21:41:46 vonkarman root[10805]: This is a test from node vonkarman. Do not panic!Next, let’s look at whether the LSF logging is also captured. Here we simply restart the LSF daemons on Nodes 2-7 and monitor the /var/log/fromnet file. The full output can be viewed below.  Output of tail -f /var/log/fromnet. Click to expand  root@turingpi:/var/log# tail -f fromnet Apr  3 21:41:57 vonkarman systemd[10786]: systemd-exit.service: Succeeded. Apr  3 21:41:57 vonkarman systemd[10786]: Finished Exit the Session. Apr  3 21:41:57 vonkarman systemd[10786]: Reached target Exit the Session. Apr  3 21:41:57 vonkarman systemd[1]: user@0.service: Succeeded. Apr  3 21:41:57 vonkarman systemd[1]: Stopped User Manager for UID 0. Apr  3 21:41:57 vonkarman systemd[1]: Stopping User Runtime Directory /run/user/0... Apr  3 21:41:57 vonkarman systemd[1]: run-user-0.mount: Succeeded. Apr  3 21:41:57 vonkarman systemd[1]: user-runtime-dir@0.service: Succeeded. Apr  3 21:41:57 vonkarman systemd[1]: Stopped User Runtime Directory /run/user/0. Apr  3 21:41:57 vonkarman systemd[1]: Removed slice User Slice of UID 0. Apr  3 21:44:30 wigner dhcpcd[493]: eth0: Router Advertisement from fe80::da58:d7ff:fe00:6d83 Apr  3 21:44:57 szilard sshd[11234]: Accepted publickey for root from 192.168.1.172 port 52600 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  3 21:44:57 szilard sshd[11234]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:58 szilard systemd[1]: Created slice User Slice of UID 0. Apr  3 21:44:58 szilard systemd[1]: Starting User Runtime Directory /run/user/0... Apr  3 21:44:58 szilard systemd-logind[382]: New session 30 of user root. Apr  3 21:44:58 szilard systemd[1]: Finished User Runtime Directory /run/user/0. Apr  3 21:44:58 szilard systemd[1]: Starting User Manager for UID 0... Apr  3 21:44:58 szilard systemd[11237]: pam_unix(systemd-user:session): session opened for user root(uid=0) by(uid=0) Apr  3 21:44:57 wigner sshd[11342]: Accepted publickey for root from 192.168.1.172 port 60388 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  3 21:44:57 wigner sshd[11342]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:58 wigner systemd[1]: Created slice User Slice of UID 0. Apr  3 21:44:58 wigner systemd[1]: Starting User Runtime Directory /run/user/0... Apr  3 21:44:58 wigner systemd-logind[383]: New session 30 of user root. Apr  3 21:44:58 wigner systemd[1]: Finished User Runtime Directory /run/user/0. Apr  3 21:44:58 wigner systemd[1]: Starting User Manager for UID 0... Apr  3 21:44:58 wigner systemd[11345]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:57 neumann sshd[11436]: Accepted publickey for root from 192.168.1.172 port 55144 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  3 21:44:57 neumann sshd[11436]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:57 neumann systemd[1]: Created slice User Slice of UID 0. Apr  3 21:44:57 neumann systemd[1]: Starting User Runtime Directory /run/user/0... Apr  3 21:44:58 neumann systemd-logind[398]: New session 30 of user root. Apr  3 21:44:58 neumann systemd[1]: Finished User Runtime Directory /run/user/0. Apr  3 21:44:58 neumann systemd[1]: Starting User Manager for UID 0... Apr  3 21:44:58 neumann systemd[11439]: pam_unix(systemd-user:session): session opened for user root(uid=0) by(uid=0) Apr  3 21:44:57 kemeny sshd[11345]: Accepted publickey for root from 192.168.1.172 port 59830 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  3 21:44:57 kemeny sshd[11345]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:58 kemeny systemd[1]: Created slice User Slice of UID 0. Apr  3 21:44:58 kemeny systemd[1]: Starting User Runtime Directory /run/user/0... Apr  3 21:44:58 kemeny systemd-logind[386]: New session 30 of user root. Apr  3 21:44:58 kemeny systemd[1]: Finished User Runtime Directory /run/user/0. Apr  3 21:44:58 kemeny systemd[1]: Starting User Manager for UID 0... Apr  3 21:44:58 kemeny systemd[11348]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:57 teller sshd[11189]: Accepted publickey for root from 192.168.1.172 port 35310 ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  3 21:44:57 teller sshd[11189]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:58 teller systemd[1]: Created slice User Slice of UID 0. Apr  3 21:44:58 teller systemd[1]: Starting User Runtime Directory /run/user/0... Apr  3 21:44:58 teller systemd-logind[382]: New session 30 of user root. Apr  3 21:44:58 teller systemd[1]: Finished User Runtime Directory /run/user/0. Apr  3 21:44:58 teller systemd[1]: Starting User Manager for UID 0... Apr  3 21:44:58 teller systemd[11192]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:57 vonkarman sshd[11118]: Accepted publickey for root from 192.168.1.172 port 48654 ssh2: ED25519SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  3 21:44:58 vonkarman sshd[11118]: pam_unix(sshd:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:58 vonkarman systemd[1]: Created slice User Slice of UID 0. Apr  3 21:44:58 vonkarman systemd[1]: Starting User Runtime Directory /run/user/0... Apr  3 21:44:58 vonkarman systemd-logind[382]: New session 29 of user root. Apr  3 21:44:58 vonkarman systemd[1]: Finished User Runtime Directory /run/user/0. Apr  3 21:44:58 vonkarman systemd[1]: Starting User Manager for UID 0... Apr  3 21:44:58 vonkarman systemd[11121]: pam_unix(systemd-user:session): session opened for user root(uid=0) by (uid=0) Apr  3 21:44:58 neumann systemd[11439]: Queued start job for default target Main User Target. Apr  3 21:44:58 neumann systemd[11439]: Created slice User Application Slice. Apr  3 21:44:58 neumann systemd[11439]: Reached target Paths. Apr  3 21:44:58 neumann systemd[11439]: Reached target Timers. Apr  3 21:44:58 neumann systemd[11439]: Listening on GnuPG network certificate management daemon. Apr  3 21:44:58 neumann systemd[11439]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers). Apr  3 21:44:58 neumann systemd[11439]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Apr  3 21:44:58 neumann systemd[11439]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Apr  3 21:44:58 neumann systemd[11439]: Listening on GnuPG cryptographic agent and passphrase cache. Apr  3 21:44:58 neumann systemd[11439]: Reached target Sockets. Apr  3 21:44:58 neumann systemd[11439]: Reached target Basic System. Apr  3 21:44:58 neumann systemd[11439]: Reached target Main User Target. Apr  3 21:44:58 neumann systemd[11439]: Startup finished in 379ms. Apr  3 21:44:58 neumann systemd[1]: Started User Manager for UID 0. Apr  3 21:44:58 neumann systemd[1]: Started Session 30 of user root. Apr  3 21:44:58 teller systemd[11192]: Queued start job for default target Main User Target. Apr  3 21:44:58 teller systemd[11192]: Created slice User Application Slice. Apr  3 21:44:58 teller systemd[11192]: Reached target Paths. Apr  3 21:44:58 teller systemd[11192]: Reached target Timers. Apr  3 21:44:58 teller systemd[11192]: Listening on GnuPG network certificate management daemon. Apr  3 21:44:58 teller systemd[11192]: Listening on GnuPG cryptographic agent and passphrase cache (access forweb browsers). Apr  3 21:44:58 teller systemd[11192]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Apr  3 21:44:58 teller systemd[11192]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Apr  3 21:44:58 teller systemd[11192]: Listening on GnuPG cryptographic agent and passphrase cache. Apr  3 21:44:58 teller systemd[11192]: Reached target Sockets. Apr  3 21:44:58 teller systemd[11192]: Reached target Basic System. Apr  3 21:44:58 teller systemd[11192]: Reached target Main User Target. Apr  3 21:44:58 teller systemd[11192]: Startup finished in 373ms. Apr  3 21:44:58 teller systemd[1]: Started User Manager for UID 0. Apr  3 21:44:58 teller systemd[1]: Started Session 30 of user root. Apr  3 21:44:58 vonkarman systemd[11121]: Queued start job for default target Main User Target. Apr  3 21:44:58 vonkarman systemd[11121]: Created slice User Application Slice. Apr  3 21:44:58 vonkarman systemd[11121]: Reached target Paths. Apr  3 21:44:58 vonkarman systemd[11121]: Reached target Timers. Apr  3 21:44:58 vonkarman systemd[11121]: Listening on GnuPG network certificate management daemon. Apr  3 21:44:58 vonkarman systemd[11121]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers). Apr  3 21:44:58 vonkarman systemd[11121]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Apr  3 21:44:58 vonkarman systemd[11121]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Apr  3 21:44:58 vonkarman systemd[11121]: Listening on GnuPG cryptographic agent and passphrase cache. Apr  3 21:44:58 vonkarman systemd[11121]: Reached target Sockets. Apr  3 21:44:58 vonkarman systemd[11121]: Reached target Basic System. Apr  3 21:44:58 vonkarman systemd[11121]: Reached target Main User Target. Apr  3 21:44:58 vonkarman systemd[11121]: Startup finished in 392ms. Apr  3 21:44:58 vonkarman systemd[1]: Started User Manager for UID 0. Apr  3 21:44:58 vonkarman systemd[1]: Started Session 29 of user root. Apr  3 21:44:58 szilard systemd[11237]: Queued start job for default target Main User Target. Apr  3 21:44:58 szilard systemd[11237]: Created slice User Application Slice. Apr  3 21:44:58 szilard systemd[11237]: Reached target Paths. Apr  3 21:44:58 szilard systemd[11237]: Reached target Timers. Apr  3 21:44:58 szilard systemd[11237]: Listening on GnuPG network certificate management daemon. Apr  3 21:44:58 szilard systemd[11237]: Listening on GnuPG cryptographic agent and passphrase cache (access for web browsers). Apr  3 21:44:58 szilard systemd[11237]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Apr  3 21:44:58 szilard systemd[11237]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Apr  3 21:44:58 szilard systemd[11237]: Listening on GnuPG cryptographic agent and passphrase cache. Apr  3 21:44:58 szilard systemd[11237]: Reached target Sockets. Apr  3 21:44:58 szilard systemd[11237]: Reached target Basic System. Apr  3 21:44:58 szilard systemd[11237]: Reached target Main User Target. Apr  3 21:44:58 szilard systemd[11237]: Startup finished in 385ms. Apr  3 21:44:58 szilard systemd[1]: Started User Manager for UID 0. Apr  3 21:44:58 szilard systemd[1]: Started Session 30 of user root. Apr  3 21:44:58 wigner systemd[11345]: Queued start job for default target Main User Target. Apr  3 21:44:58 wigner systemd[11345]: Created slice User Application Slice. Apr  3 21:44:58 wigner systemd[11345]: Reached target Paths. Apr  3 21:44:58 wigner systemd[11345]: Reached target Timers. Apr  3 21:44:58 wigner systemd[11345]: Listening on GnuPG network certificate management daemon. Apr  3 21:44:58 wigner systemd[11345]: Listening on GnuPG cryptographic agent and passphrase cache (access forweb browsers). Apr  3 21:44:58 wigner systemd[11345]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Apr  3 21:44:58 wigner systemd[11345]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Apr  3 21:44:58 wigner systemd[11345]: Listening on GnuPG cryptographic agent and passphrase cache. Apr  3 21:44:58 wigner systemd[11345]: Reached target Sockets. Apr  3 21:44:58 wigner systemd[11345]: Reached target Basic System. Apr  3 21:44:58 wigner systemd[11345]: Reached target Main User Target. Apr  3 21:44:58 wigner systemd[11345]: Startup finished in 375ms. Apr  3 21:44:58 wigner systemd[1]: Started User Manager for UID 0. Apr  3 21:44:58 wigner systemd[1]: Started Session 30 of user root. Apr  3 21:44:58 kemeny systemd[11348]: Queued start job for default target Main User Target. Apr  3 21:44:58 kemeny systemd[11348]: Created slice User Application Slice. Apr  3 21:44:58 kemeny systemd[11348]: Reached target Paths. Apr  3 21:44:58 kemeny systemd[11348]: Reached target Timers. Apr  3 21:44:58 kemeny systemd[11348]: Listening on GnuPG network certificate management daemon. Apr  3 21:44:58 kemeny systemd[11348]: Listening on GnuPG cryptographic agent and passphrase cache (access forweb browsers). Apr  3 21:44:58 kemeny systemd[11348]: Listening on GnuPG cryptographic agent and passphrase cache (restricted). Apr  3 21:44:58 kemeny systemd[11348]: Listening on GnuPG cryptographic agent (ssh-agent emulation). Apr  3 21:44:58 kemeny systemd[11348]: Listening on GnuPG cryptographic agent and passphrase cache. Apr  3 21:44:58 kemeny systemd[11348]: Reached target Sockets. Apr  3 21:44:58 kemeny systemd[11348]: Reached target Basic System. Apr  3 21:44:58 kemeny systemd[11348]: Reached target Main User Target. Apr  3 21:44:58 kemeny systemd[11348]: Startup finished in 400ms. Apr  3 21:44:58 kemeny systemd[1]: Started User Manager for UID 0. Apr  3 21:44:58 kemeny systemd[1]: Started Session 30 of user root. Apr  3 21:44:59 kemeny res[691]: term_handler: Received signal 15, exiting Apr  3 21:44:59 kemeny lim[688]: term_handler: Received signal 15, exiting Apr  3 21:44:59 kemeny sbatchd[693]: Daemon on host &lt;kemeny&gt; received signal &lt;15&gt;; exiting Apr  3 21:44:59 kemeny lsf_daemons[11434]: Stopping the LSF subsystem Apr  3 21:44:59 kemeny systemd[1]: lsfd.service: Succeeded. Apr  3 21:44:59 kemeny systemd[1]: lsfd.service: Consumed 11min 56.744s CPU time. Apr  3 21:44:59 szilard lim[685]: term_handler: Received signal 15, exiting Apr  3 21:44:59 szilard res[687]: term_handler: Received signal 15, exiting Apr  3 21:44:59 szilard sbatchd[689]: Daemon on host &lt;szilard&gt; received signal &lt;15&gt;; exiting Apr  3 21:44:59 vonkarman lim[686]: term_handler: Received signal 15, exiting Apr  3 21:44:59 vonkarman sbatchd[690]: Daemon on host &lt;vonkarman&gt; received signal &lt;15&gt;; exiting Apr  3 21:44:59 vonkarman res[688]: term_handler: Received signal 15, exiting Apr  3 21:44:59 teller lim[683]: term_handler: Received signal 15, exiting Apr  3 21:44:59 teller res[689]: term_handler: Received signal 15, exiting Apr  3 21:44:59 teller sbatchd[691]: Daemon on host &lt;teller&gt; received signal &lt;15&gt;; exiting Apr  3 21:44:59 teller lsf_daemons[11294]: Stopping the LSF subsystem Apr  3 21:44:59 wigner lim[719]: term_handler: Received signal 15, exiting Apr  3 21:44:59 wigner res[722]: term_handler: Received signal 15, exiting Apr  3 21:44:59 wigner sbatchd[724]: Daemon on host &lt;wigner&gt; received signal &lt;15&gt;; exiting Apr  3 21:44:59 wigner lsf_daemons[11438]: Stopping the LSF subsystem Apr  3 21:44:59 neumann res[713]: term_handler: Received signal 15, exiting Apr  3 21:44:59 neumann sbatchd[715]: Daemon on host &lt;neumann&gt; received signal &lt;15&gt;; exiting Apr  3 21:44:59 neumann lim[711]: term_handler: Received signal 15, exiting Apr  3 21:44:59 neumann lsf_daemons[11540]: Stopping the LSF subsystem Apr  3 21:44:59 neumann sshd[11436]: Received disconnect from 192.168.1.172 port 55144:11: disconnected by user Apr  3 21:44:59 neumann sshd[11436]: Disconnected from user root 192.168.1.172 port 55144 Apr  3 21:44:59 szilard lsf_daemons[11331]: Stopping the LSF subsystem Apr  3 21:44:59 szilard sshd[11234]: Received disconnect from 192.168.1.172 port 52600:11: disconnected by user Apr  3 21:44:59 szilard sshd[11234]: Disconnected from user root 192.168.1.172 port 52600 Apr  3 21:44:59 szilard sshd[11234]: pam_unix(sshd:session): session closed for user root Apr  3 21:44:59 szilard res[11357]: res/get_hostInfo: ls_gethostinfo() failed. Server host LIM configuration is not ready yet. Apr  3 21:44:59 szilard systemd-logind[382]: Session 30 logged out. Waiting for processes to exit. Apr  3 21:44:59 szilard res[11357]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 szilard systemd[1]: lsfd.service: Succeeded. Apr  3 21:44:59 szilard systemd[1]: lsfd.service: Consumed 1h 17min 44.040s CPU time. Apr  3 21:44:59 neumann sshd[11436]: pam_unix(sshd:session): session closed for user root Apr  3 21:44:59 neumann systemd-logind[398]: Session 30 logged out. Waiting for processes to exit. Apr  3 21:44:59 neumann res[11559]: res/get_hostInfo: ls_gethostinfo() failed. Server host LIM configuration is not ready yet. Apr  3 21:44:59 neumann res[11559]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 neumann systemd[1]: lsfd.service: Succeeded. Apr  3 21:44:59 neumann systemd[1]: lsfd.service: Consumed 1h 17min 21.135s CPU time. Apr  3 21:44:59 teller sshd[11189]: Received disconnect from 192.168.1.172 port 35310:11: disconnected by user Apr  3 21:44:59 teller sshd[11189]: Disconnected from user root 192.168.1.172 port 35310 Apr  3 21:44:59 teller sshd[11189]: pam_unix(sshd:session): session closed for user root Apr  3 21:44:59 teller systemd-logind[382]: Session 30 logged out. Waiting for processes to exit. Apr  3 21:44:59 teller res[11307]: res/get_hostInfo: ls_gethostinfo() failed. Server host LIM configuration isnot ready yet. Apr  3 21:44:59 teller res[11307]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 teller res[11307]: term_handler: Received signal 15, exiting Apr  3 21:44:59 teller lim[11305]: term_handler: Received signal 15, exiting Apr  3 21:44:59 teller systemd[1]: lsfd.service: Succeeded. Apr  3 21:44:59 teller systemd[1]: lsfd.service: Consumed 1h 17min 47.675s CPU time. Apr  3 21:44:59 teller sbatchd[11309]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 kemeny sshd[11345]: Received disconnect from 192.168.1.172 port 59830:11: disconnected by user Apr  3 21:44:59 kemeny sshd[11345]: Disconnected from user root 192.168.1.172 port 59830 Apr  3 21:44:59 kemeny sshd[11345]: pam_unix(sshd:session): session closed for user root Apr  3 21:44:59 kemeny systemd-logind[386]: Session 30 logged out. Waiting for processes to exit. Apr  3 21:44:59 kemeny res[11467]: res/get_hostInfo: ls_gethostinfo() failed. Server host LIM configuration isnot ready yet. Apr  3 21:44:59 kemeny res[11467]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 vonkarman lsf_daemons[11215]: Stopping the LSF subsystem Apr  3 21:44:59 vonkarman sshd[11118]: Received disconnect from 192.168.1.172 port 48654:11: disconnected by user Apr  3 21:44:59 vonkarman sshd[11118]: Disconnected from user root 192.168.1.172 port 48654 Apr  3 21:44:59 vonkarman sshd[11118]: pam_unix(sshd:session): session closed for user root Apr  3 21:44:59 vonkarman systemd-logind[382]: Session 29 logged out. Waiting for processes to exit. Apr  3 21:44:59 vonkarman res[11241]: res/get_hostInfo: ls_gethostinfo() failed. Server host LIM configurationis not ready yet. Apr  3 21:44:59 vonkarman res[11241]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 vonkarman systemd[1]: lsfd.service: Succeeded. Apr  3 21:44:59 vonkarman systemd[1]: lsfd.service: Consumed 1h 17min 34.650s CPU time. Apr  3 21:44:59 wigner sshd[11342]: Received disconnect from 192.168.1.172 port 60388:11: disconnected by user Apr  3 21:44:59 wigner sshd[11342]: Disconnected from user root 192.168.1.172 port 60388 Apr  3 21:44:59 wigner sshd[11342]: pam_unix(sshd:session): session closed for user root Apr  3 21:44:59 wigner res[11464]: res/get_hostInfo: ls_gethostinfo() failed. Server host LIM configuration isnot ready yet. Apr  3 21:44:59 wigner systemd-logind[383]: Session 30 logged out. Waiting for processes to exit. Apr  3 21:44:59 wigner res[11464]: cg_load_hierarchies: Please use the LSF package with higher glibc version to enable LSF cgroup v2 support. Apr  3 21:44:59 wigner systemd[1]: lsfd.service: Succeeded. Apr  3 21:44:59 wigner systemd[1]: lsfd.service: Consumed 1h 17min 44.610s CPU time.As expected, we observed that LSF log messages are written to the fromnet file. And importantly each entry contains the hostname, so that we can identify the origin of the message.ConclusionWhat started out as a chat about logging, grew into an idea of a blog, for which I am thankful for the collaboration of Peter. We’ve illustrated an example here of how to setup centralized logging on a Turing Pi system with syslog-ng to collect system and LSF logs.Of course collecting log messages centrally is just the start of a journey. It is an important step as it allows for significantly easier debugging and troubleshooting. You can store logs to databases for easier search. And once you better understand which log messages are important, you can even potentially parse those and generate alersts from them or dashboards. All of these help you to make sure that your HPC system runs smoothly and with minimal downtime. For me this was a learning experience and I&rsquo;ll be looking how I can implement more broadly centralized logging in my home network.",
            "content_html": "<p>Logs are one of those indispensable things in IT when things go wrong. Having worked in technical support for software products in a past life, I’ve likely looked at hundreds (or more) logs over the years, helping to identify issues. So, I really appreciate the importance of logs, but I can honestly say that I never really thought about a logging strategy for the systems on my home network - primarily those running Linux.</p><p>One of my longtime friends, <a href=\"https://peter.czanik.hu/\">Peter Czanik</a>, who also works in IT, happens to be a logging guru as well as an IBM Champion for Power Systems (yeah!). So it’s only natural that we get to talking about logging. He is often complaining that even at IT security conferences people are unaware of the importance of central logging. So, why is it so important? For security it’s obvious: logs are stored independently from the compromised system, so they cannot be modified or deleted by the attacker. But central logging is beneficial for the HPC operator as well. First of all, it’s availability. You can read the logs even if one of your nodes becomes unreachable. Instead of trying to breath life into the failed node, you can just take a look at the logs and see a broken hard drive, or a similar deadly problem. And it is also convenience, as all logs are available at a single location. Logging into each node on the 3 node cluster to check locally saved logs is inconvenient but doable. On a 10 node cluster it takes a long time. On a 100 node cluster a couple of working days. While, if your logs are collected to a central location, maybe a single grep command, or search in a Kibana or similar web interface.</p><p>Those who follow my blog will know that I’ve been tinkering with a Turing Pi V1 system lately. You can read my latest post <a href=\"https://www.gaborsamu.com/blog/turingpi_noctua/\">here</a>. For me, the Turing Pi has always been a cluster in a box. My Turing Pi is fully populated with 7 compute modules. I’ve designed Node 1 to be the NFS server and LSF manager for the cluster. LSF is a workload scheduler for high-performance computing (HPC) from IBM. Naturally I turned to Peter for his guidance on this, and the result is this blog. Peter recommended that I  use <a href=\"https://www.syslog-ng.com/\">syslog-ng</a> for log aggregation and also helped me through some of my first steps with <em>syslog-ng</em>. And the goal was to aggregate both the system (syslog) as well as LSF logs on Node 1. TL;DR it was easy to get it all working. But I encourage you to read on to better understand the nuances and necessary configuration both syslog-ng and LSF that was needed.</p><p><strong>The environment</strong></p><p>The following software has been deployed on the Turing Pi:</p><ul><li>Raspberry Pi OS (<em>2023-02-21-raspios-bullseye-arm64-lite.img</em>)</li><li>syslog-ng 3 – (3.28.1 as supplied with Raspberry Pi OS)</li><li>IBM LSF Standard Edition V10.1.0.13</li></ul><p>The Turing Pi system is configured as follows:</p><p>Node 1 (<em>turingpi</em>) is the manager node of this cluster in a box and has by far the most storage. Naturally we want to use that as the centralized logging server.</p><hr /><table><thead><tr><th><strong>Node</strong></th><th><strong>Hostname</strong></th><th><strong>Hardware</strong></th><th><strong>Notes</strong></th></tr></thead><tbody><tr><td>1</td><td>turingpi</td><td>CM3+</td><td>LSF manager, NFS server, 128GB SDcard</td></tr><tr><td>2</td><td>kemeny</td><td>CM3</td><td>4GB eMMC flash</td></tr><tr><td>3</td><td>neumann</td><td>CM3+</td><td>8GB SDcard</td></tr><tr><td>4</td><td>szilard</td><td>CM3+</td><td>8GB SDcard</td></tr><tr><td>5</td><td>teller</td><td>CM3+</td><td>8GB SDcard</td></tr><tr><td>6</td><td>vonkarman</td><td>CM3+</td><td>8GB SDcard</td></tr><tr><td>7</td><td>wigner</td><td>CM3+</td><td>8GB SDcard</td></tr></tbody></table><hr /><p><strong>Syslog-ng &amp; LSF setup</strong></p><ol><li>Raspberry Pi OS configures <em>rsyslog</em> out of the box. The first step is to install <em>syslog-ng</em> on Node 1 in the environment. Note that installing syslog-ng automatically disables <em>rsyslog</em> on the nodes.</li></ol><p><details>  <strong>Output of <em>apt update; apt-get install syslog-ng -y</em>. Click to expand</strong>  <div class=\"highlight\"><pre><code class=\"language-python\">root<span style=\"color: #a6e22e;\">@turingpi</span>:<span style=\"color: #f92672;\">~</span><span style=\"color: #75715e;\"># apt update; apt-get install syslog-ng -y </span>Hit:<span style=\"color: #ae81ff;\">1</span> http:<span style=\"color: #f92672;\">//</span>security<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian<span style=\"color: #f92672;\">-</span>security bullseye<span style=\"color: #f92672;\">-</span>security InReleaseHit:<span style=\"color: #ae81ff;\">2</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye InRelease                                                        Hit:<span style=\"color: #ae81ff;\">3</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">-</span>updates InRelease                                                Hit:<span style=\"color: #ae81ff;\">4</span> https:<span style=\"color: #f92672;\">//</span>repos<span style=\"color: #f92672;\">.</span>influxdata<span style=\"color: #f92672;\">.</span>com<span style=\"color: #f92672;\">/</span>debian stable InRelease                                                   Hit:<span style=\"color: #ae81ff;\">5</span> https:<span style=\"color: #f92672;\">//</span>repos<span style=\"color: #f92672;\">.</span>influxdata<span style=\"color: #f92672;\">.</span>com<span style=\"color: #f92672;\">/</span>debian bullseye InRelease                                                 Hit:<span style=\"color: #ae81ff;\">6</span> http:<span style=\"color: #f92672;\">//</span>archive<span style=\"color: #f92672;\">.</span>raspberrypi<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye InRelease                                  Hit:<span style=\"color: #ae81ff;\">7</span> https:<span style=\"color: #f92672;\">//</span>packagecloud<span style=\"color: #f92672;\">.</span>io<span style=\"color: #f92672;\">/</span>ookla<span style=\"color: #f92672;\">/</span>speedtest<span style=\"color: #f92672;\">-</span>cli<span style=\"color: #f92672;\">/</span>debian bullseye InRelease                     Reading package lists<span style=\"color: #f92672;\">...</span> DoneBuilding dependency tree<span style=\"color: #f92672;\">...</span> DoneReading state information<span style=\"color: #f92672;\">...</span> DoneAll packages are up to date<span style=\"color: #f92672;\">.</span>Reading package lists<span style=\"color: #f92672;\">...</span> DoneBuilding dependency tree<span style=\"color: #f92672;\">...</span> DoneReading state information<span style=\"color: #f92672;\">...</span> DoneThe following additional packages will be installed:  libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libdbi1 libesmtp6 libhiredis0<span style=\"color: #ae81ff;\">.14</span> libivykis0 libmaxminddb0 libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libmongocrypt0  libnet1 libprotobuf<span style=\"color: #f92672;\">-</span>c1 librabbitmq4 librdkafka1 libriemann<span style=\"color: #f92672;\">-</span>client0 libsnappy1v5 libsnmp<span style=\"color: #f92672;\">-</span>base libsnmp40  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parserSuggested packages:  mmdb<span style=\"color: #f92672;\">-</span>bin snmp<span style=\"color: #f92672;\">-</span>mibs<span style=\"color: #f92672;\">-</span>downloader rabbitmq<span style=\"color: #f92672;\">-</span>server graphite<span style=\"color: #f92672;\">-</span>web mongodb<span style=\"color: #f92672;\">-</span>server libdbd<span style=\"color: #f92672;\">-</span>mysql libdbd<span style=\"color: #f92672;\">-</span>pgsql  libdbd<span style=\"color: #f92672;\">-</span>sqlite3 activemqThe following packages will be REMOVED:  rsyslogThe following NEW packages will be installed:  libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libdbi1 libesmtp6 libhiredis0<span style=\"color: #ae81ff;\">.14</span> libivykis0 libmaxminddb0 libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libmongocrypt0  libnet1 libprotobuf<span style=\"color: #f92672;\">-</span>c1 librabbitmq4 librdkafka1 libriemann<span style=\"color: #f92672;\">-</span>client0 libsnappy1v5 libsnmp<span style=\"color: #f92672;\">-</span>base libsnmp40  syslog<span style=\"color: #f92672;\">-</span>ng syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser<span style=\"color: #ae81ff;\">0</span> upgraded, <span style=\"color: #ae81ff;\">39</span> newly installed, <span style=\"color: #ae81ff;\">1</span> to remove <span style=\"color: #f92672;\">and</span> <span style=\"color: #ae81ff;\">0</span> <span style=\"color: #f92672;\">not</span> upgraded<span style=\"color: #f92672;\">.</span>Need to get <span style=\"color: #ae81ff;\">7</span>,<span style=\"color: #ae81ff;\">015</span> kB of archives<span style=\"color: #f92672;\">.</span>After this operation, <span style=\"color: #ae81ff;\">15.1</span> MB of additional disk space will be used<span style=\"color: #f92672;\">.</span>Get:<span style=\"color: #ae81ff;\">1</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> arm64 <span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">69.7</span> kB]Get:<span style=\"color: #ae81ff;\">2</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libmongocrypt0 arm64 <span style=\"color: #ae81ff;\">1.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">114</span> kB]Get:<span style=\"color: #ae81ff;\">3</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsnappy1v5 arm64 <span style=\"color: #ae81ff;\">1.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">17.2</span> kB]Get:<span style=\"color: #ae81ff;\">4</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> arm64 <span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">257</span> kB]Get:<span style=\"color: #ae81ff;\">5</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libivykis0 arm64 <span style=\"color: #ae81ff;\">0.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">25.3</span> kB]Get:<span style=\"color: #ae81ff;\">6</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libnet1 arm64 <span style=\"color: #ae81ff;\">1.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span> [<span style=\"color: #ae81ff;\">56.8</span> kB]Get:<span style=\"color: #ae81ff;\">7</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">591</span> kB]Get:<span style=\"color: #ae81ff;\">8</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">37.9</span> kB]Get:<span style=\"color: #ae81ff;\">9</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libdbi1 arm64 <span style=\"color: #ae81ff;\">0.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span> [<span style=\"color: #ae81ff;\">27.8</span> kB]Get:<span style=\"color: #ae81ff;\">10</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">41.5</span> kB]Get:<span style=\"color: #ae81ff;\">11</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libesmtp6 arm64 <span style=\"color: #ae81ff;\">1.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span> [<span style=\"color: #ae81ff;\">52.0</span> kB]Get:<span style=\"color: #ae81ff;\">12</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libhiredis0<span style=\"color: #ae81ff;\">.14</span> arm64 <span style=\"color: #ae81ff;\">0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">33.7</span> kB]Get:<span style=\"color: #ae81ff;\">13</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libmaxminddb0 arm64 <span style=\"color: #ae81ff;\">1.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">29.6</span> kB]Get:<span style=\"color: #ae81ff;\">14</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libprotobuf<span style=\"color: #f92672;\">-</span>c1 arm64 <span style=\"color: #ae81ff;\">1.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2 [<span style=\"color: #ae81ff;\">26.8</span> kB]Get:<span style=\"color: #ae81ff;\">15</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 librabbitmq4 arm64 <span style=\"color: #ae81ff;\">0.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">39.7</span> kB]Get:<span style=\"color: #ae81ff;\">16</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 librdkafka1 arm64 <span style=\"color: #ae81ff;\">1.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">515</span> kB]Get:<span style=\"color: #ae81ff;\">17</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libriemann<span style=\"color: #f92672;\">-</span>client0 arm64 <span style=\"color: #ae81ff;\">1.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2 [<span style=\"color: #ae81ff;\">21.9</span> kB]Get:<span style=\"color: #ae81ff;\">18</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsnmp<span style=\"color: #f92672;\">-</span>base all <span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">1</span>,<span style=\"color: #ae81ff;\">736</span> kB]Get:<span style=\"color: #ae81ff;\">19</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsnmp40 arm64 <span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">2</span>,<span style=\"color: #ae81ff;\">497</span> kB]Get:<span style=\"color: #ae81ff;\">20</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng all <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">25.9</span> kB]Get:<span style=\"color: #ae81ff;\">21</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">40.5</span> kB]Get:<span style=\"color: #ae81ff;\">22</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">48.8</span> kB]Get:<span style=\"color: #ae81ff;\">23</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">57.3</span> kB]Get:<span style=\"color: #ae81ff;\">24</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra all <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">35.7</span> kB]Get:<span style=\"color: #ae81ff;\">25</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">36.9</span> kB]Get:<span style=\"color: #ae81ff;\">26</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">29.4</span> kB]Get:<span style=\"color: #ae81ff;\">27</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">50.5</span> kB]Get:<span style=\"color: #ae81ff;\">28</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">69.9</span> kB]Get:<span style=\"color: #ae81ff;\">29</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">41.5</span> kB]Get:<span style=\"color: #ae81ff;\">30</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">37.6</span> kB]Get:<span style=\"color: #ae81ff;\">31</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">40.1</span> kB]Get:<span style=\"color: #ae81ff;\">32</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">63.3</span> kB]Get:<span style=\"color: #ae81ff;\">33</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">38.0</span> kB]Get:<span style=\"color: #ae81ff;\">34</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">42.5</span> kB]Get:<span style=\"color: #ae81ff;\">35</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">39.1</span> kB]Get:<span style=\"color: #ae81ff;\">36</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">34.7</span> kB]Get:<span style=\"color: #ae81ff;\">37</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">29.5</span> kB]Get:<span style=\"color: #ae81ff;\">38</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">34.0</span> kB]Get:<span style=\"color: #ae81ff;\">39</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">28.6</span> kB]Fetched <span style=\"color: #ae81ff;\">7</span>,<span style=\"color: #ae81ff;\">015</span> kB <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">5</span>s (<span style=\"color: #ae81ff;\">1</span>,<span style=\"color: #ae81ff;\">311</span> kB<span style=\"color: #f92672;\">/</span>s)           Extracting templates <span style=\"color: #f92672;\">from</span> packages: <span style=\"color: #ae81ff;\">100</span><span style=\"color: #f92672;\">%</span>(Reading database <span style=\"color: #f92672;\">...</span> <span style=\"color: #ae81ff;\">90182</span> files <span style=\"color: #f92672;\">and</span> directories currently installed<span style=\"color: #f92672;\">.</span>)Removing rsyslog (<span style=\"color: #ae81ff;\">8.2102.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0.</span>(Reading database <span style=\"color: #f92672;\">...</span> <span style=\"color: #ae81ff;\">90124</span> files <span style=\"color: #f92672;\">and</span> directories currently installed<span style=\"color: #f92672;\">.</span>)Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">00</span><span style=\"color: #f92672;\">-</span>libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0_1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libmongocrypt0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">01</span><span style=\"color: #f92672;\">-</span>libmongocrypt0_1<span style=\"color: #ae81ff;\">.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libmongocrypt0:arm64 (<span style=\"color: #ae81ff;\">1.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsnappy1v5:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">02</span><span style=\"color: #f92672;\">-</span>libsnappy1v5_1<span style=\"color: #ae81ff;\">.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsnappy1v5:arm64 (<span style=\"color: #ae81ff;\">1.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span>libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0_1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libivykis0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">04</span><span style=\"color: #f92672;\">-</span>libivykis0_0<span style=\"color: #ae81ff;\">.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libivykis0:arm64 (<span style=\"color: #ae81ff;\">0.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libnet1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">05</span><span style=\"color: #f92672;\">-</span>libnet1_1<span style=\"color: #ae81ff;\">.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libnet1:arm64 (<span style=\"color: #ae81ff;\">1.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">06</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">07</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libdbi1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">08</span><span style=\"color: #f92672;\">-</span>libdbi1_0<span style=\"color: #ae81ff;\">.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libdbi1:arm64 (<span style=\"color: #ae81ff;\">0.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">09</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libesmtp6<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">10</span><span style=\"color: #f92672;\">-</span>libesmtp6_1<span style=\"color: #ae81ff;\">.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libesmtp6 (<span style=\"color: #ae81ff;\">1.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libhiredis0<span style=\"color: #ae81ff;\">.14</span>:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">11</span><span style=\"color: #f92672;\">-</span>libhiredis0<span style=\"color: #ae81ff;\">.14_0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libhiredis0<span style=\"color: #ae81ff;\">.14</span>:arm64 (<span style=\"color: #ae81ff;\">0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libmaxminddb0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">12</span><span style=\"color: #f92672;\">-</span>libmaxminddb0_1<span style=\"color: #ae81ff;\">.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libmaxminddb0:arm64 (<span style=\"color: #ae81ff;\">1.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libprotobuf<span style=\"color: #f92672;\">-</span>c1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">13</span><span style=\"color: #f92672;\">-</span>libprotobuf<span style=\"color: #f92672;\">-</span>c1_1<span style=\"color: #ae81ff;\">.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libprotobuf<span style=\"color: #f92672;\">-</span>c1:arm64 (<span style=\"color: #ae81ff;\">1.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package librabbitmq4:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">14</span><span style=\"color: #f92672;\">-</span>librabbitmq4_0<span style=\"color: #ae81ff;\">.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking librabbitmq4:arm64 (<span style=\"color: #ae81ff;\">0.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package librdkafka1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">-</span>librdkafka1_1<span style=\"color: #ae81ff;\">.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking librdkafka1:arm64 (<span style=\"color: #ae81ff;\">1.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libriemann<span style=\"color: #f92672;\">-</span>client0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">16</span><span style=\"color: #f92672;\">-</span>libriemann<span style=\"color: #f92672;\">-</span>client0_1<span style=\"color: #ae81ff;\">.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libriemann<span style=\"color: #f92672;\">-</span>client0:arm64 (<span style=\"color: #ae81ff;\">1.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsnmp<span style=\"color: #f92672;\">-</span>base<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">17</span><span style=\"color: #f92672;\">-</span>libsnmp<span style=\"color: #f92672;\">-</span>base_5<span style=\"color: #ae81ff;\">.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsnmp<span style=\"color: #f92672;\">-</span>base (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsnmp40:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">18</span><span style=\"color: #f92672;\">-</span>libsnmp40_5<span style=\"color: #ae81ff;\">.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsnmp40:arm64 (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">19</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">20</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">21</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">22</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">23</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">24</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">25</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">26</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">27</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">28</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">29</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">30</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">31</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">32</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">33</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">34</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">35</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">36</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">37</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">38</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up librabbitmq4:arm64 (<span style=\"color: #ae81ff;\">0.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libdbi1:arm64 (<span style=\"color: #ae81ff;\">0.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span>) <span style=\"color: #f92672;\">...</span>Setting up libsnmp<span style=\"color: #f92672;\">-</span>base (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up libmaxminddb0:arm64 (<span style=\"color: #ae81ff;\">1.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libesmtp6 (<span style=\"color: #ae81ff;\">1.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span>) <span style=\"color: #f92672;\">...</span>Setting up libnet1:arm64 (<span style=\"color: #ae81ff;\">1.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span>) <span style=\"color: #f92672;\">...</span>Setting up libprotobuf<span style=\"color: #f92672;\">-</span>c1:arm64 (<span style=\"color: #ae81ff;\">1.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Setting up libsnappy1v5:arm64 (<span style=\"color: #ae81ff;\">1.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libsnmp40:arm64 (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libivykis0:arm64 (<span style=\"color: #ae81ff;\">0.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libriemann<span style=\"color: #f92672;\">-</span>client0:arm64 (<span style=\"color: #ae81ff;\">1.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Setting up librdkafka1:arm64 (<span style=\"color: #ae81ff;\">1.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libhiredis0<span style=\"color: #ae81ff;\">.14</span>:arm64 (<span style=\"color: #ae81ff;\">0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libmongocrypt0:arm64 (<span style=\"color: #ae81ff;\">1.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Created symlink <span style=\"color: #f92672;\">/</span>etc<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>multi<span style=\"color: #f92672;\">-</span>user<span style=\"color: #f92672;\">.</span>target<span style=\"color: #f92672;\">.</span>wants<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #960050; background-color: #1e0010;\">→</span> <span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service<span style=\"color: #f92672;\">.</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Processing triggers <span style=\"color: #66d9ef;\">for</span> man<span style=\"color: #f92672;\">-</span>db (<span style=\"color: #ae81ff;\">2.9.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span>) <span style=\"color: #f92672;\">...</span>Processing triggers <span style=\"color: #66d9ef;\">for</span> libc<span style=\"color: #f92672;\">-</span>bin (<span style=\"color: #ae81ff;\">2.31</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">13</span><span style=\"color: #f92672;\">+</span>rpt2<span style=\"color: #f92672;\">+</span>rpi1<span style=\"color: #f92672;\">+</span>deb11u8) <span style=\"color: #f92672;\">...</span>Scanning processes<span style=\"color: #f92672;\">...</span>                                                                                         Scanning processor microcode<span style=\"color: #f92672;\">...</span>                                                                               Scanning linux images<span style=\"color: #f92672;\">...</span>                                                                                      Running kernel seems to be up<span style=\"color: #f92672;\">-</span>to<span style=\"color: #f92672;\">-</span>date<span style=\"color: #f92672;\">.</span>Failed to check <span style=\"color: #66d9ef;\">for</span> processor microcode upgrades<span style=\"color: #f92672;\">.</span>No services need to be restarted<span style=\"color: #f92672;\">.</span>No containers need to be restarted<span style=\"color: #f92672;\">.</span>No user sessions are running outdated binaries<span style=\"color: #f92672;\">.</span></code></pre></div></details><br /><!-- raw HTML omitted -->2. With <em>syslog-ng</em> installed, it’s now time to build the configuration for it. A new configuration file <em>fromnet.conf</em> is shown below, in which a <em>syslog-ng</em> destination is created which will aggregate logs from the Turing Pi nodes in <em>/var/log/fromnet</em> in plain text format. Additionally, the logs will be written in JSON format to the file <em>/var/log/fromnet.json</em>.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\">root@turingpi:~# cat /etc/syslog-ng/fromnet.conf # sourcesource s_fromnet {  syslog(port(601));};# destination destination d_fromnet {  file(\"/var/log/fromnet\");  file(\"/var/log/fromnet.json\" template(\"$(format-json --scope rfc5424 --scope dot-nv-pairs        --rekey .* --shift 1 --scope nv-pairs)\\n\") );};# log pathlog {  source(s_fromnet);  destination(d_fromnet);}; </code></pre></div><ol start=\"3\"><li>Unless we only want to see source IP addresses in the collected logs, it’s necessary to update the <em>syslog-ng</em> configuration file <em>/etc/syslog-ng/syslog-ng.conf</em> to record the hostnames from which the log messages have originated. This is done by adding the <em>keep_hostname(yes)</em> parameter to the options section as follows:</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">........# First, set some global options. options { chain_hostnames(off); flush_lines(0); use_dns(no); use_fqdn(no);                  keep_hostname(yes);dns_cache(no); owner(\"root\"); group(\"adm\"); perm(0640);         stats_freq(0); bad_hostname(\"^gconfd$\"); };........</code></pre></div><ol start=\"4\"><li>Next, the IBM LSF configuration is updated to prevent the creation of local logfiles for the LSF daemons. This is done by commenting the <em>LSF_LOGDIR</em> option in the configuration file <em>$LSF_ENVDIR/lsf.conf</em>. At the same time, we also set <em>LSF_LOG_MASK=LOG_DEBUG</em> for testing purposes to enable verbose logging for the LSF daemons.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">........# Daemon log messages# LSF_LOGDIR=/opt/ibm/lsf/logLSF_LOG_MASK=LOG_DEBUG........</code></pre></div><ol start=\"5\"><li>Finally, to make the changes take effect, both syslog-ng and LSF are restarted.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\">root@turingpi:~# systemctl restart syslog-ng root@turingpi:~# . /opt/ibm/lsf/conf/profile.lsf  root@turingpi:~# lsf_daemons restart Stopping the LSF subsystem Starting the LSF subsystem</code></pre></div><ol start=\"6\"><li>With the configuration ready on the centralized logging server, host <em>turingpi</em>, we now turn our attention to Nodes 2-7 in the cluster. Here we’ll use the <em>parallel-ssh</em> tool to streamline some operations. We start with the installation of <em>syslog-ng</em> across Nodes 2-7. Note that the output of the installation of <em>syslog-ng</em> across the compute nodes has been truncated.</li></ol><p><details>  <strong>Truncated output of <em>parallel-ssh -h /opt/workers -i &ldquo;apt-get install syslog-ng -y&rdquo;</em>. Click to expand</strong>  <div class=\"highlight\"><pre><code class=\"language-python\">root<span style=\"color: #a6e22e;\">@turingpi</span>:<span style=\"color: #f92672;\">~</span><span style=\"color: #75715e;\"># parallel-ssh -h /opt/workers -i \"apt-get install syslog-ng -y\" </span>[<span style=\"color: #ae81ff;\">1</span>] <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">07</span> [SUCCESS] kemenyReading package lists<span style=\"color: #f92672;\">...</span>Building dependency tree<span style=\"color: #f92672;\">...</span>Reading state information<span style=\"color: #f92672;\">...</span>The following additional packages will be installed:  libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libdbi1 libesmtp6 libhiredis0<span style=\"color: #ae81ff;\">.14</span> libivykis0 libmaxminddb0  libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libmongocrypt0 libnet1 libprotobuf<span style=\"color: #f92672;\">-</span>c1 librabbitmq4  librdkafka1 libriemann<span style=\"color: #f92672;\">-</span>client0 libsensors<span style=\"color: #f92672;\">-</span>config libsensors5 libsnappy1v5  libsnmp<span style=\"color: #f92672;\">-</span>base libsnmp40 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parserSuggested packages:  mmdb<span style=\"color: #f92672;\">-</span>bin lm<span style=\"color: #f92672;\">-</span>sensors snmp<span style=\"color: #f92672;\">-</span>mibs<span style=\"color: #f92672;\">-</span>downloader rabbitmq<span style=\"color: #f92672;\">-</span>server graphite<span style=\"color: #f92672;\">-</span>web  mongodb<span style=\"color: #f92672;\">-</span>server libdbd<span style=\"color: #f92672;\">-</span>mysql libdbd<span style=\"color: #f92672;\">-</span>pgsql libdbd<span style=\"color: #f92672;\">-</span>sqlite3 activemqThe following packages will be REMOVED:  rsyslogThe following NEW packages will be installed:  libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libdbi1 libesmtp6 libhiredis0<span style=\"color: #ae81ff;\">.14</span> libivykis0 libmaxminddb0  libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> libmongocrypt0 libnet1 libprotobuf<span style=\"color: #f92672;\">-</span>c1 librabbitmq4  librdkafka1 libriemann<span style=\"color: #f92672;\">-</span>client0 libsensors<span style=\"color: #f92672;\">-</span>config libsensors5 libsnappy1v5  libsnmp<span style=\"color: #f92672;\">-</span>base libsnmp40 syslog<span style=\"color: #f92672;\">-</span>ng syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql  syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser<span style=\"color: #ae81ff;\">0</span> upgraded, <span style=\"color: #ae81ff;\">41</span> newly installed, <span style=\"color: #ae81ff;\">1</span> to remove <span style=\"color: #f92672;\">and</span> <span style=\"color: #ae81ff;\">0</span> <span style=\"color: #f92672;\">not</span> upgraded<span style=\"color: #f92672;\">.</span>Need to get <span style=\"color: #ae81ff;\">7</span>,<span style=\"color: #ae81ff;\">098</span> kB of archives<span style=\"color: #f92672;\">.</span>After this operation, <span style=\"color: #ae81ff;\">15.3</span> MB of additional disk space will be used<span style=\"color: #f92672;\">.</span>Get:<span style=\"color: #ae81ff;\">1</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> arm64 <span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">69.7</span> kB]Get:<span style=\"color: #ae81ff;\">2</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libmongocrypt0 arm64 <span style=\"color: #ae81ff;\">1.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">114</span> kB]Get:<span style=\"color: #ae81ff;\">3</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsnappy1v5 arm64 <span style=\"color: #ae81ff;\">1.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">17.2</span> kB]Get:<span style=\"color: #ae81ff;\">4</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> arm64 <span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">257</span> kB]Get:<span style=\"color: #ae81ff;\">5</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libivykis0 arm64 <span style=\"color: #ae81ff;\">0.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">25.3</span> kB]Get:<span style=\"color: #ae81ff;\">6</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libnet1 arm64 <span style=\"color: #ae81ff;\">1.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span> [<span style=\"color: #ae81ff;\">56.8</span> kB]Get:<span style=\"color: #ae81ff;\">7</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">591</span> kB]Get:<span style=\"color: #ae81ff;\">8</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">37.9</span> kB]Get:<span style=\"color: #ae81ff;\">9</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libdbi1 arm64 <span style=\"color: #ae81ff;\">0.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span> [<span style=\"color: #ae81ff;\">27.8</span> kB]Get:<span style=\"color: #ae81ff;\">10</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">41.5</span> kB]Get:<span style=\"color: #ae81ff;\">11</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libesmtp6 arm64 <span style=\"color: #ae81ff;\">1.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span> [<span style=\"color: #ae81ff;\">52.0</span> kB]Get:<span style=\"color: #ae81ff;\">12</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libhiredis0<span style=\"color: #ae81ff;\">.14</span> arm64 <span style=\"color: #ae81ff;\">0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">33.7</span> kB]Get:<span style=\"color: #ae81ff;\">13</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libmaxminddb0 arm64 <span style=\"color: #ae81ff;\">1.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">29.6</span> kB]Get:<span style=\"color: #ae81ff;\">14</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libprotobuf<span style=\"color: #f92672;\">-</span>c1 arm64 <span style=\"color: #ae81ff;\">1.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2 [<span style=\"color: #ae81ff;\">26.8</span> kB]Get:<span style=\"color: #ae81ff;\">15</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 librabbitmq4 arm64 <span style=\"color: #ae81ff;\">0.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">39.7</span> kB]Get:<span style=\"color: #ae81ff;\">16</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 librdkafka1 arm64 <span style=\"color: #ae81ff;\">1.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span> [<span style=\"color: #ae81ff;\">515</span> kB]Get:<span style=\"color: #ae81ff;\">17</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libriemann<span style=\"color: #f92672;\">-</span>client0 arm64 <span style=\"color: #ae81ff;\">1.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2 [<span style=\"color: #ae81ff;\">21.9</span> kB]Get:<span style=\"color: #ae81ff;\">18</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsensors<span style=\"color: #f92672;\">-</span>config all <span style=\"color: #ae81ff;\">1</span>:<span style=\"color: #ae81ff;\">3.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span> [<span style=\"color: #ae81ff;\">32.3</span> kB]Get:<span style=\"color: #ae81ff;\">19</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsensors5 arm64 <span style=\"color: #ae81ff;\">1</span>:<span style=\"color: #ae81ff;\">3.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span> [<span style=\"color: #ae81ff;\">51.2</span> kB]Get:<span style=\"color: #ae81ff;\">20</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsnmp<span style=\"color: #f92672;\">-</span>base all <span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">1</span>,<span style=\"color: #ae81ff;\">736</span> kB]Get:<span style=\"color: #ae81ff;\">21</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 libsnmp40 arm64 <span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">2</span>,<span style=\"color: #ae81ff;\">497</span> kB]Get:<span style=\"color: #ae81ff;\">22</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng all <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">25.9</span> kB]Get:<span style=\"color: #ae81ff;\">23</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">40.5</span> kB]Get:<span style=\"color: #ae81ff;\">24</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">48.8</span> kB]Get:<span style=\"color: #ae81ff;\">25</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">57.3</span> kB]Get:<span style=\"color: #ae81ff;\">26</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra all <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">35.7</span> kB]Get:<span style=\"color: #ae81ff;\">27</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">36.9</span> kB]Get:<span style=\"color: #ae81ff;\">28</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">29.4</span> kB]Get:<span style=\"color: #ae81ff;\">29</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">50.5</span> kB]Get:<span style=\"color: #ae81ff;\">30</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">69.9</span> kB]Get:<span style=\"color: #ae81ff;\">31</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">41.5</span> kB]Get:<span style=\"color: #ae81ff;\">32</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">37.6</span> kB]Get:<span style=\"color: #ae81ff;\">33</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">40.1</span> kB]Get:<span style=\"color: #ae81ff;\">34</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">63.3</span> kB]Get:<span style=\"color: #ae81ff;\">35</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">38.0</span> kB]Get:<span style=\"color: #ae81ff;\">36</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">42.5</span> kB]Get:<span style=\"color: #ae81ff;\">37</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">39.1</span> kB]Get:<span style=\"color: #ae81ff;\">38</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">34.7</span> kB]Get:<span style=\"color: #ae81ff;\">39</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">29.5</span> kB]Get:<span style=\"color: #ae81ff;\">40</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">34.0</span> kB]Get:<span style=\"color: #ae81ff;\">41</span> http:<span style=\"color: #f92672;\">//</span>deb<span style=\"color: #f92672;\">.</span>debian<span style=\"color: #f92672;\">.</span>org<span style=\"color: #f92672;\">/</span>debian bullseye<span style=\"color: #f92672;\">/</span>main arm64 syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate arm64 <span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1 [<span style=\"color: #ae81ff;\">28.6</span> kB]Fetched <span style=\"color: #ae81ff;\">7</span>,<span style=\"color: #ae81ff;\">098</span> kB <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">2</span>s (<span style=\"color: #ae81ff;\">3</span>,<span style=\"color: #ae81ff;\">566</span> kB<span style=\"color: #f92672;\">/</span>s)(Reading database <span style=\"color: #f92672;\">...</span> <span style=\"color: #ae81ff;\">37650</span> files <span style=\"color: #f92672;\">and</span> directories currently installed<span style=\"color: #f92672;\">.</span>)Removing rsyslog (<span style=\"color: #ae81ff;\">8.2102.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0.</span>(Reading database <span style=\"color: #f92672;\">...</span> <span style=\"color: #ae81ff;\">37592</span> files <span style=\"color: #f92672;\">and</span> directories currently installed<span style=\"color: #f92672;\">.</span>)Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">00</span><span style=\"color: #f92672;\">-</span>libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0_1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libmongocrypt0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">01</span><span style=\"color: #f92672;\">-</span>libmongocrypt0_1<span style=\"color: #ae81ff;\">.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libmongocrypt0:arm64 (<span style=\"color: #ae81ff;\">1.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsnappy1v5:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">02</span><span style=\"color: #f92672;\">-</span>libsnappy1v5_1<span style=\"color: #ae81ff;\">.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsnappy1v5:arm64 (<span style=\"color: #ae81ff;\">1.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span>libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0_1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libivykis0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">04</span><span style=\"color: #f92672;\">-</span>libivykis0_0<span style=\"color: #ae81ff;\">.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libivykis0:arm64 (<span style=\"color: #ae81ff;\">0.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libnet1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">05</span><span style=\"color: #f92672;\">-</span>libnet1_1<span style=\"color: #ae81ff;\">.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libnet1:arm64 (<span style=\"color: #ae81ff;\">1.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">06</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">07</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libdbi1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">08</span><span style=\"color: #f92672;\">-</span>libdbi1_0<span style=\"color: #ae81ff;\">.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libdbi1:arm64 (<span style=\"color: #ae81ff;\">0.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">09</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libesmtp6<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">10</span><span style=\"color: #f92672;\">-</span>libesmtp6_1<span style=\"color: #ae81ff;\">.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libesmtp6 (<span style=\"color: #ae81ff;\">1.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libhiredis0<span style=\"color: #ae81ff;\">.14</span>:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">11</span><span style=\"color: #f92672;\">-</span>libhiredis0<span style=\"color: #ae81ff;\">.14_0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libhiredis0<span style=\"color: #ae81ff;\">.14</span>:arm64 (<span style=\"color: #ae81ff;\">0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libmaxminddb0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">12</span><span style=\"color: #f92672;\">-</span>libmaxminddb0_1<span style=\"color: #ae81ff;\">.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libmaxminddb0:arm64 (<span style=\"color: #ae81ff;\">1.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libprotobuf<span style=\"color: #f92672;\">-</span>c1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">13</span><span style=\"color: #f92672;\">-</span>libprotobuf<span style=\"color: #f92672;\">-</span>c1_1<span style=\"color: #ae81ff;\">.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libprotobuf<span style=\"color: #f92672;\">-</span>c1:arm64 (<span style=\"color: #ae81ff;\">1.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package librabbitmq4:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">14</span><span style=\"color: #f92672;\">-</span>librabbitmq4_0<span style=\"color: #ae81ff;\">.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking librabbitmq4:arm64 (<span style=\"color: #ae81ff;\">0.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package librdkafka1:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">-</span>librdkafka1_1<span style=\"color: #ae81ff;\">.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking librdkafka1:arm64 (<span style=\"color: #ae81ff;\">1.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libriemann<span style=\"color: #f92672;\">-</span>client0:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">16</span><span style=\"color: #f92672;\">-</span>libriemann<span style=\"color: #f92672;\">-</span>client0_1<span style=\"color: #ae81ff;\">.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libriemann<span style=\"color: #f92672;\">-</span>client0:arm64 (<span style=\"color: #ae81ff;\">1.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsensors<span style=\"color: #f92672;\">-</span>config<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">17</span><span style=\"color: #f92672;\">-</span>libsensors<span style=\"color: #f92672;\">-</span>config_1<span style=\"color: #f92672;\">%</span><span style=\"color: #ae81ff;\">3</span>a3<span style=\"color: #ae81ff;\">.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span>_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsensors<span style=\"color: #f92672;\">-</span>config (<span style=\"color: #ae81ff;\">1</span>:<span style=\"color: #ae81ff;\">3.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsensors5:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">18</span><span style=\"color: #f92672;\">-</span>libsensors5_1<span style=\"color: #f92672;\">%</span><span style=\"color: #ae81ff;\">3</span>a3<span style=\"color: #ae81ff;\">.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span>_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsensors5:arm64 (<span style=\"color: #ae81ff;\">1</span>:<span style=\"color: #ae81ff;\">3.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span>) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsnmp<span style=\"color: #f92672;\">-</span>base<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">19</span><span style=\"color: #f92672;\">-</span>libsnmp<span style=\"color: #f92672;\">-</span>base_5<span style=\"color: #ae81ff;\">.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsnmp<span style=\"color: #f92672;\">-</span>base (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package libsnmp40:arm64<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">20</span><span style=\"color: #f92672;\">-</span>libsnmp40_5<span style=\"color: #ae81ff;\">.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking libsnmp40:arm64 (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">21</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">22</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">23</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">24</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">25</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_all<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">26</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">27</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">28</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">29</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">30</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">31</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">32</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">33</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">34</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">35</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">36</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">37</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">38</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">39</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Selecting previously unselected package syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate<span style=\"color: #f92672;\">.</span>Preparing to unpack <span style=\"color: #f92672;\">.../</span><span style=\"color: #ae81ff;\">40</span><span style=\"color: #f92672;\">-</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate_3<span style=\"color: #ae81ff;\">.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1_arm64<span style=\"color: #f92672;\">.</span>deb <span style=\"color: #f92672;\">...</span>Unpacking syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up librabbitmq4:arm64 (<span style=\"color: #ae81ff;\">0.10.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libdbi1:arm64 (<span style=\"color: #ae81ff;\">0.9.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">6</span>) <span style=\"color: #f92672;\">...</span>Setting up libsnmp<span style=\"color: #f92672;\">-</span>base (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up libmaxminddb0:arm64 (<span style=\"color: #ae81ff;\">1.5.2</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libsensors<span style=\"color: #f92672;\">-</span>config (<span style=\"color: #ae81ff;\">1</span>:<span style=\"color: #ae81ff;\">3.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span>) <span style=\"color: #f92672;\">...</span>Setting up libesmtp6 (<span style=\"color: #ae81ff;\">1.0.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4.3</span>) <span style=\"color: #f92672;\">...</span>Setting up libnet1:arm64 (<span style=\"color: #ae81ff;\">1.1.6</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">3.1</span>) <span style=\"color: #f92672;\">...</span>Setting up libprotobuf<span style=\"color: #f92672;\">-</span>c1:arm64 (<span style=\"color: #ae81ff;\">1.3.3</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Setting up libsnappy1v5:arm64 (<span style=\"color: #ae81ff;\">1.1.8</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libbson<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libivykis0:arm64 (<span style=\"color: #ae81ff;\">0.42.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libriemann<span style=\"color: #f92672;\">-</span>client0:arm64 (<span style=\"color: #ae81ff;\">1.10.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>b2) <span style=\"color: #f92672;\">...</span>Setting up libsensors5:arm64 (<span style=\"color: #ae81ff;\">1</span>:<span style=\"color: #ae81ff;\">3.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">7</span>) <span style=\"color: #f92672;\">...</span>Setting up librdkafka1:arm64 (<span style=\"color: #ae81ff;\">1.6.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libhiredis0<span style=\"color: #ae81ff;\">.14</span>:arm64 (<span style=\"color: #ae81ff;\">0.14.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libmongocrypt0:arm64 (<span style=\"color: #ae81ff;\">1.1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up libsnmp40:arm64 (<span style=\"color: #ae81ff;\">5.9</span><span style=\"color: #f92672;\">+</span>dfsg<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">4</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up libmongoc<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1.0</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0</span> (<span style=\"color: #ae81ff;\">1.17.6</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">1</span>) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>core (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Created symlink <span style=\"color: #f92672;\">/</span>etc<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>multi<span style=\"color: #f92672;\">-</span>user<span style=\"color: #f92672;\">.</span>target<span style=\"color: #f92672;\">.</span>wants<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #960050; background-color: #1e0010;\">→</span> <span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service<span style=\"color: #f92672;\">.</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>examples (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>xml<span style=\"color: #f92672;\">-</span>parser (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stomp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>riemann (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>stardate (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>geoip2 (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>getent (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>amqp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>python (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>smtp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>snmp (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>extra (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>rdkafka (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>graphite (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>add<span style=\"color: #f92672;\">-</span>contextual<span style=\"color: #f92672;\">-</span>data (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>mongodb (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>http (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>slog (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>map<span style=\"color: #f92672;\">-</span>value<span style=\"color: #f92672;\">-</span>pairs (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>sql (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">-</span>mod<span style=\"color: #f92672;\">-</span>redis (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Setting up syslog<span style=\"color: #f92672;\">-</span>ng (<span style=\"color: #ae81ff;\">3.28.1</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span><span style=\"color: #f92672;\">+</span>deb11u1) <span style=\"color: #f92672;\">...</span>Processing triggers <span style=\"color: #66d9ef;\">for</span> man<span style=\"color: #f92672;\">-</span>db (<span style=\"color: #ae81ff;\">2.9.4</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">2</span>) <span style=\"color: #f92672;\">...</span>Processing triggers <span style=\"color: #66d9ef;\">for</span> libc<span style=\"color: #f92672;\">-</span>bin (<span style=\"color: #ae81ff;\">2.31</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">13</span><span style=\"color: #f92672;\">+</span>rpt2<span style=\"color: #f92672;\">+</span>rpi1<span style=\"color: #f92672;\">+</span>deb11u8) <span style=\"color: #f92672;\">...</span>Stderr: debconf: unable to initialize frontend: Dialogdebconf: (TERM <span style=\"color: #f92672;\">is</span> <span style=\"color: #f92672;\">not</span> set, so the dialog frontend <span style=\"color: #f92672;\">is</span> <span style=\"color: #f92672;\">not</span> usable<span style=\"color: #f92672;\">.</span>)debconf: falling back to frontend: Readlinedebconf: unable to initialize frontend: Readlinedebconf: (This frontend requires a controlling tty<span style=\"color: #f92672;\">.</span>)debconf: falling back to frontend: Teletypedpkg<span style=\"color: #f92672;\">-</span>preconfigure: unable to re<span style=\"color: #f92672;\">-</span>open stdin: <span style=\"color: #f92672;\">....</span><span style=\"color: #f92672;\">....</span></code></pre></div></details><br /><!-- raw HTML omitted -->7. Following the installation of <em>syslog-ng</em> across Nodes 2-7. We verify that the installation was successful by checking the <em>syslog-ng</em> service status.</p><p><details>  <strong>Output of <em>parallel-ssh -h /opt/workers -i &ldquo;systemctl status syslog-ng&rdquo;</em>. Click to expand</strong>  <div class=\"highlight\"><pre><code class=\"language-python\">root<span style=\"color: #a6e22e;\">@turingpi</span>:<span style=\"color: #f92672;\">~</span><span style=\"color: #75715e;\"># parallel-ssh -h /opt/workers -i \"systemctl status syslog-ng\" </span>[<span style=\"color: #ae81ff;\">1</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">03</span>:<span style=\"color: #ae81ff;\">46</span> [SUCCESS] kemeny<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">01</span> EDT; <span style=\"color: #ae81ff;\">6</span>min ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">28694</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">40.228</span>s     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">28694</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">00</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">01</span> kemeny syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">28694</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">01</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">2</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">03</span>:<span style=\"color: #ae81ff;\">50</span> [SUCCESS] vonkarman<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">49</span> EDT; <span style=\"color: #ae81ff;\">5</span>min ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">27486</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">2</span>min <span style=\"color: #ae81ff;\">5.540</span>s     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">27486</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">44</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">46</span> vonkarman syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">27486</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">49</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">3</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">03</span>:<span style=\"color: #ae81ff;\">51</span> [SUCCESS] teller<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">39</span> EDT; <span style=\"color: #ae81ff;\">6</span>min ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">24821</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">2</span>min <span style=\"color: #ae81ff;\">262</span>ms     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">24821</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">38</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">38</span> teller syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">24821</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">39</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">4</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">03</span>:<span style=\"color: #ae81ff;\">53</span> [SUCCESS] neumann<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">39</span> EDT; <span style=\"color: #ae81ff;\">6</span>min ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">27734</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">1</span>min <span style=\"color: #ae81ff;\">43.504</span>s     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">27734</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">38</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">38</span> neumann syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">27734</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">39</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">5</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">03</span>:<span style=\"color: #ae81ff;\">53</span> [SUCCESS] wigner<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">37</span> EDT; <span style=\"color: #ae81ff;\">6</span>min ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">27512</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">1</span>min <span style=\"color: #ae81ff;\">49.643</span>s     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">27512</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">36</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">36</span> wigner syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">27512</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">37</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">6</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">03</span>:<span style=\"color: #ae81ff;\">57</span> [SUCCESS] szilard<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">35</span> EDT; <span style=\"color: #ae81ff;\">6</span>min ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">24136</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">5</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">2</span>min <span style=\"color: #ae81ff;\">10.257</span>s     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">24136</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">34</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">34</span> szilard syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">24136</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">13</span>:<span style=\"color: #ae81ff;\">57</span>:<span style=\"color: #ae81ff;\">35</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span></code></pre></div></details><br /><!-- raw HTML omitted -->8. Create the  configuration file <em>send.conf</em> in <em>/opt</em> on host <em>turingpi</em>. Note that <em>/opt</em> is an NFS export on <em>turingpi</em> and is NFS mounted by all of the compute nodes. This file will set the HOST field to the local hostname for log messages that are sent. This in done in the subsequent steps where <em>“placeholder”</em> will be replaced using a <em>sed</em> operation with the local hostname. Additionally, a data source <em>s_hpc</em> is defined which will scan <em>/opt/ibm/lsf/log</em> for the presence of LSF daemon logfiles.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\"> root@turingpi:/# cat /opt/send.confrewrite r_host { set(\"placeholder\", value(\"HOST\")); };destination d_net {  syslog(\"turingpi\" port(601));};source s_hpc {  wildcard-file(      base-dir(\"/opt/ibm/lsf/log\")      filename-pattern(\"*.log.*\")      recursive(no)      follow-freq(1)  );};log {  source(s_src);  source(s_hpc);  rewrite(r_host);   destination(d_net);};</code></pre></div><ol start=\"9\"><li>On Nodes 2-7, copy the file <em>/opt/send.conf</em> to <em>/etc/syslog-ng/conf.d/send.conf</em>.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\"> root@turingpi:/# parallel-ssh -h /opt/workers -i \"cp /opt/send.conf /etc/syslog-ng/conf.d\" [1] 14:19:29 [SUCCESS] kemeny[2] 14:19:30 [SUCCESS] vonkarman[3] 14:19:30 [SUCCESS] wigner[4] 14:19:30 [SUCCESS] szilard[5] 14:19:30 [SUCCESS] teller[6] 14:19:31 [SUCCESS] neumann</code></pre></div><ol start=\"10\"><li>Using <em>sed</em>, replace the <em>“placeholder”</em> string in <em>/etc/syslog-ng/conf.d/send.conf</em> with the local hostname. And we also double check that the change was correctly made.</li></ol><div class=\"highlight\"><pre><code class=\"language-plaintext\"> root@turingpi:/# parallel-ssh -h /opt/workers -i 'HOST=`hostname`; sed -i \"s/placeholder/$HOST/g\" /etc/syslog-ng/conf.d/send.conf' [1] 14:38:09 [SUCCESS] kemeny[2] 14:38:09 [SUCCESS] teller[3] 14:38:09 [SUCCESS] vonkarman[4] 14:38:09 [SUCCESS] wigner[5] 14:38:09 [SUCCESS] neumann[6] 14:38:09 [SUCCESS] szilard</code></pre></div><p><details>  <strong>Output of <em>parallel-ssh -h /opt/workers -i &ldquo;cat /etc/syslog-ng/conf.d/send.conf&rdquo;</em>. Click to expand</strong>  <div class=\"highlight\"><pre><code class=\"language-python\">root<span style=\"color: #a6e22e;\">@turingpi</span>:<span style=\"color: #f92672;\">/</span><span style=\"color: #75715e;\"># parallel-ssh -h /opt/workers -i \"cat /etc/syslog-ng/conf.d/send.conf\" [1] 14:38:33 [SUCCESS] kemeny</span>rewrite r_host { set(<span style=\"color: #e6db74;\">\"kemeny\"</span>, value(<span style=\"color: #e6db74;\">\"HOST\"</span>)); };destination d_net {  syslog(<span style=\"color: #e6db74;\">\"turingpi\"</span> port(<span style=\"color: #ae81ff;\">601</span>));};source s_hpc {  wildcard<span style=\"color: #f92672;\">-</span>file(      base<span style=\"color: #f92672;\">-</span>dir(<span style=\"color: #e6db74;\">\"/opt/ibm/lsf/log\"</span>)      filename<span style=\"color: #f92672;\">-</span>pattern(<span style=\"color: #e6db74;\">\"*.log.*\"</span>)      recursive(no)      follow<span style=\"color: #f92672;\">-</span>freq(<span style=\"color: #ae81ff;\">1</span>)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[<span style=\"color: #ae81ff;\">2</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">38</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] tellerrewrite r_host { set(<span style=\"color: #e6db74;\">\"teller\"</span>, value(<span style=\"color: #e6db74;\">\"HOST\"</span>)); };destination d_net {  syslog(<span style=\"color: #e6db74;\">\"turingpi\"</span> port(<span style=\"color: #ae81ff;\">601</span>));};source s_hpc {  wildcard<span style=\"color: #f92672;\">-</span>file(      base<span style=\"color: #f92672;\">-</span>dir(<span style=\"color: #e6db74;\">\"/opt/ibm/lsf/log\"</span>)      filename<span style=\"color: #f92672;\">-</span>pattern(<span style=\"color: #e6db74;\">\"*.log.*\"</span>)      recursive(no)      follow<span style=\"color: #f92672;\">-</span>freq(<span style=\"color: #ae81ff;\">1</span>)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[<span style=\"color: #ae81ff;\">3</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">38</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] neumannrewrite r_host { set(<span style=\"color: #e6db74;\">\"neumann\"</span>, value(<span style=\"color: #e6db74;\">\"HOST\"</span>)); };destination d_net {  syslog(<span style=\"color: #e6db74;\">\"turingpi\"</span> port(<span style=\"color: #ae81ff;\">601</span>));};source s_hpc {  wildcard<span style=\"color: #f92672;\">-</span>file(      base<span style=\"color: #f92672;\">-</span>dir(<span style=\"color: #e6db74;\">\"/opt/ibm/lsf/log\"</span>)      filename<span style=\"color: #f92672;\">-</span>pattern(<span style=\"color: #e6db74;\">\"*.log.*\"</span>)      recursive(no)      follow<span style=\"color: #f92672;\">-</span>freq(<span style=\"color: #ae81ff;\">1</span>)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[<span style=\"color: #ae81ff;\">4</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">38</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] szilardrewrite r_host { set(<span style=\"color: #e6db74;\">\"szilard\"</span>, value(<span style=\"color: #e6db74;\">\"HOST\"</span>)); };destination d_net {  syslog(<span style=\"color: #e6db74;\">\"turingpi\"</span> port(<span style=\"color: #ae81ff;\">601</span>));};source s_hpc {  wildcard<span style=\"color: #f92672;\">-</span>file(      base<span style=\"color: #f92672;\">-</span>dir(<span style=\"color: #e6db74;\">\"/opt/ibm/lsf/log\"</span>)      filename<span style=\"color: #f92672;\">-</span>pattern(<span style=\"color: #e6db74;\">\"*.log.*\"</span>)      recursive(no)      follow<span style=\"color: #f92672;\">-</span>freq(<span style=\"color: #ae81ff;\">1</span>)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[<span style=\"color: #ae81ff;\">5</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">38</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] wignerrewrite r_host { set(<span style=\"color: #e6db74;\">\"wigner\"</span>, value(<span style=\"color: #e6db74;\">\"HOST\"</span>)); };destination d_net {  syslog(<span style=\"color: #e6db74;\">\"turingpi\"</span> port(<span style=\"color: #ae81ff;\">601</span>));};source s_hpc {  wildcard<span style=\"color: #f92672;\">-</span>file(      base<span style=\"color: #f92672;\">-</span>dir(<span style=\"color: #e6db74;\">\"/opt/ibm/lsf/log\"</span>)      filename<span style=\"color: #f92672;\">-</span>pattern(<span style=\"color: #e6db74;\">\"*.log.*\"</span>)      recursive(no)      follow<span style=\"color: #f92672;\">-</span>freq(<span style=\"color: #ae81ff;\">1</span>)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};[<span style=\"color: #ae81ff;\">6</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">38</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] vonkarmanrewrite r_host { set(<span style=\"color: #e6db74;\">\"vonkarman\"</span>, value(<span style=\"color: #e6db74;\">\"HOST\"</span>)); };destination d_net {  syslog(<span style=\"color: #e6db74;\">\"turingpi\"</span> port(<span style=\"color: #ae81ff;\">601</span>));};source s_hpc {  wildcard<span style=\"color: #f92672;\">-</span>file(      base<span style=\"color: #f92672;\">-</span>dir(<span style=\"color: #e6db74;\">\"/opt/ibm/lsf/log\"</span>)      filename<span style=\"color: #f92672;\">-</span>pattern(<span style=\"color: #e6db74;\">\"*.log.*\"</span>)      recursive(no)      follow<span style=\"color: #f92672;\">-</span>freq(<span style=\"color: #ae81ff;\">1</span>)  );};log {  source(s_sys);  source(s_hpc);  rewrite(r_host);   destination(d_net);};</code></pre></div></details><br /><!-- raw HTML omitted -->11. Finally, <em>syslog-ng</em> is restarted on Nodes 2-7 and the status of the service is checked to ensure that there are no errors.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\"> root@turingpi:/opt# parallel-ssh -h /opt/workers -i \"systemctl restart syslog-ng\" [1] 14:49:03 [SUCCESS] kemeny[2] 14:49:05 [SUCCESS] szilard[3] 14:49:06 [SUCCESS] vonkarman[4] 14:49:06 [SUCCESS] neumann[5] 14:49:06 [SUCCESS] teller[6] 14:49:07 [SUCCESS] wigner</code></pre></div><p><details>  <strong>Output of <em>parallel-ssh -h /opt/workers -i &ldquo;systemctl status syslog-ng&rdquo;</em>. Click to expand</strong>  <div class=\"highlight\"><pre><code class=\"language-python\">root<span style=\"color: #a6e22e;\">@turingpi</span>:<span style=\"color: #f92672;\">/</span>opt<span style=\"color: #75715e;\"># parallel-ssh -h /opt/workers -i \"systemctl status syslog-ng\" </span>[<span style=\"color: #ae81ff;\">1</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">31</span> [SUCCESS] kemeny<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> EDT; <span style=\"color: #ae81ff;\">28</span>s ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">34982</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">398</span>ms     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">34982</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">02</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">02</span> kemeny syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">34982</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">2</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] vonkarman<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">06</span> EDT; <span style=\"color: #ae81ff;\">25</span>s ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">33710</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">934</span>ms     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">33710</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> vonkarman syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">33710</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">06</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">3</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] neumann<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">06</span> EDT; <span style=\"color: #ae81ff;\">25</span>s ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">34000</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">959</span>ms     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">34000</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> neumann syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">34000</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">06</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">4</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">33</span> [SUCCESS] wigner<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">07</span> EDT; <span style=\"color: #ae81ff;\">25</span>s ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">33941</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">1.115</span>s     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">33941</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">04</span> wigner syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">33941</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">07</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">5</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">34</span> [SUCCESS] szilard<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">05</span> EDT; <span style=\"color: #ae81ff;\">26</span>s ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">30348</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">816</span>ms     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">30348</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>FMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting System Logger Daemon<span style=\"color: #f92672;\">...</span>Mar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">03</span> szilard syslog<span style=\"color: #f92672;\">-</span>ng[<span style=\"color: #ae81ff;\">30348</span>]: DIGEST<span style=\"color: #f92672;\">-</span>MD5 common mech freeMar <span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">05</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Started System Logger Daemon<span style=\"color: #f92672;\">.</span>[<span style=\"color: #ae81ff;\">6</span>] <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">34</span> [SUCCESS] teller<span style=\"color: #960050; background-color: #1e0010;\">●</span> syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service <span style=\"color: #f92672;\">-</span> System Logger Daemon     Loaded: loaded (<span style=\"color: #f92672;\">/</span>lib<span style=\"color: #f92672;\">/</span>systemd<span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service; enabled; vendor preset: enabled)     Active: active (running) since Thu <span style=\"color: #ae81ff;\">2024</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">03</span><span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">28</span> <span style=\"color: #ae81ff;\">14</span>:<span style=\"color: #ae81ff;\">49</span>:<span style=\"color: #ae81ff;\">06</span> EDT; <span style=\"color: #ae81ff;\">25</span>s ago       Docs: man:syslog<span style=\"color: #f92672;\">-</span>ng(<span style=\"color: #ae81ff;\">8</span>)   Main PID: <span style=\"color: #ae81ff;\">31034</span> (syslog<span style=\"color: #f92672;\">-</span>ng)      Tasks: <span style=\"color: #ae81ff;\">2</span> (limit: <span style=\"color: #ae81ff;\">779</span>)        CPU: <span style=\"color: #ae81ff;\">965</span>ms     CGroup: <span style=\"color: #f92672;\">/</span>system<span style=\"color: #f92672;\">.</span>slice<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng<span style=\"color: #f92672;\">.</span>service             <span style=\"color: #960050; background-color: #1e0010;\">└─</span><span style=\"color: #ae81ff;\">31034</span> <span style=\"color: #f92672;\">/</span>usr<span style=\"color: #f92672;\">/</span>sbin<span style=\"color: #f92672;\">/</span>syslog<span style=\"color: #f92672;\">-</span>ng <span style=\"color: #f92672;\">-</span>F</code></pre></div></details><br /><!-- raw HTML omitted --><strong>Does it work?</strong></p><p>The answer to this question is an emphatic YES!</p><p>Let’s begin with a simple test running the <em>logger</em> command on all of the compute nodes, while monitoring <em>/var/log/fromnet</em> on host <em>turingpi</em>.</p><div class=\"highlight\"><pre><code class=\"language-plaintext\"> root@turingpi:/home/lsfadmin# date; parallel-ssh -h /opt/workers -i 'HOST=`hostname`; logger This is a test from node $HOST. Do not panic!' Wed  3 Apr 21:41:45 EDT 2024 [1] 21:41:46 [SUCCESS] teller [2] 21:41:46 [SUCCESS] neumann [3] 21:41:46 [SUCCESS] wigner [4] 21:41:46 [SUCCESS] kemeny [5] 21:41:46 [SUCCESS] szilard [6] 21:41:46 [SUCCESS] vonkarmanroot@turingpi:/var/log# tail -f fromnet |grep panic Apr  3 21:41:46 szilard root[10918]: This is a test from node szilard. Do not panic! Apr  3 21:41:46 wigner root[11011]: This is a test from node wigner. Do not panic! Apr  3 21:41:46 neumann root[11121]: This is a test from node neumann. Do not panic! Apr  3 21:41:46 kemeny root[11029]: This is a test from node kemeny. Do not panic! Apr  3 21:41:46 teller root[10875]: This is a test from node teller. Do not panic! Apr  3 21:41:46 vonkarman root[10805]: This is a test from node vonkarman. Do not panic!</code></pre></div><p>Next, let’s look at whether the LSF logging is also captured. Here we simply restart the LSF daemons on Nodes 2-7 and monitor the <em>/var/log/fromnet</em> file. The full output can be viewed below.</p><p><details>  <strong>Output of <em>tail -f /var/log/fromnet</em>. Click to expand</strong>  <div class=\"highlight\"><pre><code class=\"language-python\">root<span style=\"color: #a6e22e;\">@turingpi</span>:<span style=\"color: #f92672;\">/</span>var<span style=\"color: #f92672;\">/</span>log<span style=\"color: #75715e;\"># tail -f fromnet </span>Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">10786</span>]: systemd<span style=\"color: #f92672;\">-</span>exit<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">10786</span>]: Finished Exit the Session<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">10786</span>]: Reached target Exit the Session<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: user<span style=\"color: #f92672;\">@</span><span style=\"color: #ae81ff;\">0.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Stopped User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Stopping User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: run<span style=\"color: #f92672;\">-</span>user<span style=\"color: #f92672;\">-</span><span style=\"color: #ae81ff;\">0.</span>mount: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: user<span style=\"color: #f92672;\">-</span>runtime<span style=\"color: #f92672;\">-</span>dir<span style=\"color: #f92672;\">@</span><span style=\"color: #ae81ff;\">0.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Stopped User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">41</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Removed slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">30</span> wigner dhcpcd[<span style=\"color: #ae81ff;\">493</span>]: eth0: Router Advertisement <span style=\"color: #f92672;\">from</span> fe80::da58:d7ff:fe00:<span style=\"color: #ae81ff;\">6</span>d83 Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> szilard sshd[<span style=\"color: #ae81ff;\">11234</span>]: Accepted publickey <span style=\"color: #66d9ef;\">for</span> root <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">52600</span> ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> szilard sshd[<span style=\"color: #ae81ff;\">11234</span>]: pam_unix(sshd:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Created slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">382</span>]: New session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Finished User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: pam_unix(systemd<span style=\"color: #f92672;\">-</span>user:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> wigner sshd[<span style=\"color: #ae81ff;\">11342</span>]: Accepted publickey <span style=\"color: #66d9ef;\">for</span> root <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">60388</span> ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> wigner sshd[<span style=\"color: #ae81ff;\">11342</span>]: pam_unix(sshd:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Created slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">383</span>]: New session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Finished User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: pam_unix(systemd<span style=\"color: #f92672;\">-</span>user:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> neumann sshd[<span style=\"color: #ae81ff;\">11436</span>]: Accepted publickey <span style=\"color: #66d9ef;\">for</span> root <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">55144</span> ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> neumann sshd[<span style=\"color: #ae81ff;\">11436</span>]: pam_unix(sshd:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Created slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">398</span>]: New session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Finished User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: pam_unix(systemd<span style=\"color: #f92672;\">-</span>user:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> kemeny sshd[<span style=\"color: #ae81ff;\">11345</span>]: Accepted publickey <span style=\"color: #66d9ef;\">for</span> root <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">59830</span> ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> kemeny sshd[<span style=\"color: #ae81ff;\">11345</span>]: pam_unix(sshd:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Created slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">386</span>]: New session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Finished User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: pam_unix(systemd<span style=\"color: #f92672;\">-</span>user:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> teller sshd[<span style=\"color: #ae81ff;\">11189</span>]: Accepted publickey <span style=\"color: #66d9ef;\">for</span> root <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">35310</span> ssh2: ED25519 SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> teller sshd[<span style=\"color: #ae81ff;\">11189</span>]: pam_unix(sshd:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Created slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">382</span>]: New session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Finished User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: pam_unix(systemd<span style=\"color: #f92672;\">-</span>user:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">57</span> vonkarman sshd[<span style=\"color: #ae81ff;\">11118</span>]: Accepted publickey <span style=\"color: #66d9ef;\">for</span> root <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">48654</span> ssh2: ED25519SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman sshd[<span style=\"color: #ae81ff;\">11118</span>]: pam_unix(sshd:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Created slice User Slice of UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">382</span>]: New session <span style=\"color: #ae81ff;\">29</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Finished User Runtime Directory <span style=\"color: #f92672;\">/</span>run<span style=\"color: #f92672;\">/</span>user<span style=\"color: #f92672;\">/</span><span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Starting User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span><span style=\"color: #f92672;\">..</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: pam_unix(systemd<span style=\"color: #f92672;\">-</span>user:session): session opened <span style=\"color: #66d9ef;\">for</span> user root(uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) by (uid<span style=\"color: #f92672;\">=</span><span style=\"color: #ae81ff;\">0</span>) Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Queued start job <span style=\"color: #66d9ef;\">for</span> default target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Created slice User Application Slice<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Reached target Paths<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Reached target Timers<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Listening on GnuPG network certificate management daemon<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (access for web browsers)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (restricted)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Listening on GnuPG cryptographic agent (ssh<span style=\"color: #f92672;\">-</span>agent emulation)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Reached target Sockets<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Reached target Basic System<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Reached target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">11439</span>]: Startup finished <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">379</span>ms<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Started User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: Started Session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Queued start job <span style=\"color: #66d9ef;\">for</span> default target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Created slice User Application Slice<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Reached target Paths<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Reached target Timers<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Listening on GnuPG network certificate management daemon<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (access <span style=\"color: #66d9ef;\">for</span>web browsers)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (restricted)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Listening on GnuPG cryptographic agent (ssh<span style=\"color: #f92672;\">-</span>agent emulation)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Reached target Sockets<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Reached target Basic System<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Reached target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">11192</span>]: Startup finished <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">373</span>ms<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Started User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: Started Session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Queued start job <span style=\"color: #66d9ef;\">for</span> default target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Created slice User Application Slice<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Reached target Paths<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Reached target Timers<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Listening on GnuPG network certificate management daemon<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (access <span style=\"color: #66d9ef;\">for</span> web browsers)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (restricted)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Listening on GnuPG cryptographic agent (ssh<span style=\"color: #f92672;\">-</span>agent emulation)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Reached target Sockets<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Reached target Basic System<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Reached target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">11121</span>]: Startup finished <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">392</span>ms<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Started User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: Started Session <span style=\"color: #ae81ff;\">29</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Queued start job <span style=\"color: #66d9ef;\">for</span> default target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Created slice User Application Slice<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Reached target Paths<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Reached target Timers<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Listening on GnuPG network certificate management daemon<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (access for web browsers)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (restricted)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Listening on GnuPG cryptographic agent (ssh<span style=\"color: #f92672;\">-</span>agent emulation)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Reached target Sockets<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Reached target Basic System<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Reached target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">11237</span>]: Startup finished <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">385</span>ms<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Started User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: Started Session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Queued start job <span style=\"color: #66d9ef;\">for</span> default target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Created slice User Application Slice<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Reached target Paths<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Reached target Timers<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Listening on GnuPG network certificate management daemon<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (access <span style=\"color: #66d9ef;\">for</span>web browsers)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (restricted)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Listening on GnuPG cryptographic agent (ssh<span style=\"color: #f92672;\">-</span>agent emulation)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Reached target Sockets<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Reached target Basic System<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Reached target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">11345</span>]: Startup finished <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">375</span>ms<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Started User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: Started Session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Queued start job <span style=\"color: #66d9ef;\">for</span> default target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Created slice User Application Slice<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Reached target Paths<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Reached target Timers<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Listening on GnuPG network certificate management daemon<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (access <span style=\"color: #66d9ef;\">for</span>web browsers)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache (restricted)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Listening on GnuPG cryptographic agent (ssh<span style=\"color: #f92672;\">-</span>agent emulation)<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Listening on GnuPG cryptographic agent <span style=\"color: #f92672;\">and</span> passphrase cache<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Reached target Sockets<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Reached target Basic System<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Reached target Main User Target<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">11348</span>]: Startup finished <span style=\"color: #f92672;\">in</span> <span style=\"color: #ae81ff;\">400</span>ms<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Started User Manager <span style=\"color: #66d9ef;\">for</span> UID <span style=\"color: #ae81ff;\">0.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">58</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: Started Session <span style=\"color: #ae81ff;\">30</span> of user root<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny res[<span style=\"color: #ae81ff;\">691</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny lim[<span style=\"color: #ae81ff;\">688</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny sbatchd[<span style=\"color: #ae81ff;\">693</span>]: Daemon on host <span style=\"color: #f92672;\">&lt;</span>kemeny<span style=\"color: #f92672;\">&gt;</span> received signal <span style=\"color: #f92672;\">&lt;</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">&gt;</span>; exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny lsf_daemons[<span style=\"color: #ae81ff;\">11434</span>]: Stopping the LSF subsystem Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Consumed <span style=\"color: #ae81ff;\">11</span>min <span style=\"color: #ae81ff;\">56.744</span>s CPU time<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard lim[<span style=\"color: #ae81ff;\">685</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard res[<span style=\"color: #ae81ff;\">687</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard sbatchd[<span style=\"color: #ae81ff;\">689</span>]: Daemon on host <span style=\"color: #f92672;\">&lt;</span>szilard<span style=\"color: #f92672;\">&gt;</span> received signal <span style=\"color: #f92672;\">&lt;</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">&gt;</span>; exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman lim[<span style=\"color: #ae81ff;\">686</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman sbatchd[<span style=\"color: #ae81ff;\">690</span>]: Daemon on host <span style=\"color: #f92672;\">&lt;</span>vonkarman<span style=\"color: #f92672;\">&gt;</span> received signal <span style=\"color: #f92672;\">&lt;</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">&gt;</span>; exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman res[<span style=\"color: #ae81ff;\">688</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller lim[<span style=\"color: #ae81ff;\">683</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller res[<span style=\"color: #ae81ff;\">689</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller sbatchd[<span style=\"color: #ae81ff;\">691</span>]: Daemon on host <span style=\"color: #f92672;\">&lt;</span>teller<span style=\"color: #f92672;\">&gt;</span> received signal <span style=\"color: #f92672;\">&lt;</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">&gt;</span>; exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller lsf_daemons[<span style=\"color: #ae81ff;\">11294</span>]: Stopping the LSF subsystem Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner lim[<span style=\"color: #ae81ff;\">719</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner res[<span style=\"color: #ae81ff;\">722</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner sbatchd[<span style=\"color: #ae81ff;\">724</span>]: Daemon on host <span style=\"color: #f92672;\">&lt;</span>wigner<span style=\"color: #f92672;\">&gt;</span> received signal <span style=\"color: #f92672;\">&lt;</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">&gt;</span>; exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner lsf_daemons[<span style=\"color: #ae81ff;\">11438</span>]: Stopping the LSF subsystem Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann res[<span style=\"color: #ae81ff;\">713</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann sbatchd[<span style=\"color: #ae81ff;\">715</span>]: Daemon on host <span style=\"color: #f92672;\">&lt;</span>neumann<span style=\"color: #f92672;\">&gt;</span> received signal <span style=\"color: #f92672;\">&lt;</span><span style=\"color: #ae81ff;\">15</span><span style=\"color: #f92672;\">&gt;</span>; exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann lim[<span style=\"color: #ae81ff;\">711</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann lsf_daemons[<span style=\"color: #ae81ff;\">11540</span>]: Stopping the LSF subsystem Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann sshd[<span style=\"color: #ae81ff;\">11436</span>]: Received disconnect <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">55144</span>:<span style=\"color: #ae81ff;\">11</span>: disconnected by user Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann sshd[<span style=\"color: #ae81ff;\">11436</span>]: Disconnected <span style=\"color: #f92672;\">from</span> user root <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">55144</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard lsf_daemons[<span style=\"color: #ae81ff;\">11331</span>]: Stopping the LSF subsystem Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard sshd[<span style=\"color: #ae81ff;\">11234</span>]: Received disconnect <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">52600</span>:<span style=\"color: #ae81ff;\">11</span>: disconnected by user Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard sshd[<span style=\"color: #ae81ff;\">11234</span>]: Disconnected <span style=\"color: #f92672;\">from</span> user root <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">52600</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard sshd[<span style=\"color: #ae81ff;\">11234</span>]: pam_unix(sshd:session): session closed <span style=\"color: #66d9ef;\">for</span> user root Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard res[<span style=\"color: #ae81ff;\">11357</span>]: res<span style=\"color: #f92672;\">/</span>get_hostInfo: ls_gethostinfo() failed<span style=\"color: #f92672;\">.</span> Server host LIM configuration is <span style=\"color: #f92672;\">not</span> ready yet<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">382</span>]: Session <span style=\"color: #ae81ff;\">30</span> logged out<span style=\"color: #f92672;\">.</span> Waiting <span style=\"color: #66d9ef;\">for</span> processes to exit<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard res[<span style=\"color: #ae81ff;\">11357</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> szilard systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Consumed <span style=\"color: #ae81ff;\">1</span>h <span style=\"color: #ae81ff;\">17</span>min <span style=\"color: #ae81ff;\">44.040</span>s CPU time<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann sshd[<span style=\"color: #ae81ff;\">11436</span>]: pam_unix(sshd:session): session closed <span style=\"color: #66d9ef;\">for</span> user root Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">398</span>]: Session <span style=\"color: #ae81ff;\">30</span> logged out<span style=\"color: #f92672;\">.</span> Waiting <span style=\"color: #66d9ef;\">for</span> processes to exit<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann res[<span style=\"color: #ae81ff;\">11559</span>]: res<span style=\"color: #f92672;\">/</span>get_hostInfo: ls_gethostinfo() failed<span style=\"color: #f92672;\">.</span> Server host LIM configuration is <span style=\"color: #f92672;\">not</span> ready yet<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann res[<span style=\"color: #ae81ff;\">11559</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> neumann systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Consumed <span style=\"color: #ae81ff;\">1</span>h <span style=\"color: #ae81ff;\">17</span>min <span style=\"color: #ae81ff;\">21.135</span>s CPU time<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller sshd[<span style=\"color: #ae81ff;\">11189</span>]: Received disconnect <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">35310</span>:<span style=\"color: #ae81ff;\">11</span>: disconnected by user Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller sshd[<span style=\"color: #ae81ff;\">11189</span>]: Disconnected <span style=\"color: #f92672;\">from</span> user root <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">35310</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller sshd[<span style=\"color: #ae81ff;\">11189</span>]: pam_unix(sshd:session): session closed <span style=\"color: #66d9ef;\">for</span> user root Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">382</span>]: Session <span style=\"color: #ae81ff;\">30</span> logged out<span style=\"color: #f92672;\">.</span> Waiting <span style=\"color: #66d9ef;\">for</span> processes to exit<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller res[<span style=\"color: #ae81ff;\">11307</span>]: res<span style=\"color: #f92672;\">/</span>get_hostInfo: ls_gethostinfo() failed<span style=\"color: #f92672;\">.</span> Server host LIM configuration <span style=\"color: #f92672;\">is</span><span style=\"color: #f92672;\">not</span> ready yet<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller res[<span style=\"color: #ae81ff;\">11307</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller res[<span style=\"color: #ae81ff;\">11307</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller lim[<span style=\"color: #ae81ff;\">11305</span>]: term_handler: Received signal <span style=\"color: #ae81ff;\">15</span>, exiting Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Consumed <span style=\"color: #ae81ff;\">1</span>h <span style=\"color: #ae81ff;\">17</span>min <span style=\"color: #ae81ff;\">47.675</span>s CPU time<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> teller sbatchd[<span style=\"color: #ae81ff;\">11309</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny sshd[<span style=\"color: #ae81ff;\">11345</span>]: Received disconnect <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">59830</span>:<span style=\"color: #ae81ff;\">11</span>: disconnected by user Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny sshd[<span style=\"color: #ae81ff;\">11345</span>]: Disconnected <span style=\"color: #f92672;\">from</span> user root <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">59830</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny sshd[<span style=\"color: #ae81ff;\">11345</span>]: pam_unix(sshd:session): session closed <span style=\"color: #66d9ef;\">for</span> user root Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">386</span>]: Session <span style=\"color: #ae81ff;\">30</span> logged out<span style=\"color: #f92672;\">.</span> Waiting <span style=\"color: #66d9ef;\">for</span> processes to exit<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny res[<span style=\"color: #ae81ff;\">11467</span>]: res<span style=\"color: #f92672;\">/</span>get_hostInfo: ls_gethostinfo() failed<span style=\"color: #f92672;\">.</span> Server host LIM configuration <span style=\"color: #f92672;\">is</span><span style=\"color: #f92672;\">not</span> ready yet<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> kemeny res[<span style=\"color: #ae81ff;\">11467</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman lsf_daemons[<span style=\"color: #ae81ff;\">11215</span>]: Stopping the LSF subsystem Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman sshd[<span style=\"color: #ae81ff;\">11118</span>]: Received disconnect <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">48654</span>:<span style=\"color: #ae81ff;\">11</span>: disconnected by user Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman sshd[<span style=\"color: #ae81ff;\">11118</span>]: Disconnected <span style=\"color: #f92672;\">from</span> user root <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">48654</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman sshd[<span style=\"color: #ae81ff;\">11118</span>]: pam_unix(sshd:session): session closed <span style=\"color: #66d9ef;\">for</span> user root Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">382</span>]: Session <span style=\"color: #ae81ff;\">29</span> logged out<span style=\"color: #f92672;\">.</span> Waiting <span style=\"color: #66d9ef;\">for</span> processes to exit<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman res[<span style=\"color: #ae81ff;\">11241</span>]: res<span style=\"color: #f92672;\">/</span>get_hostInfo: ls_gethostinfo() failed<span style=\"color: #f92672;\">.</span> Server host LIM configuration<span style=\"color: #f92672;\">is</span> <span style=\"color: #f92672;\">not</span> ready yet<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman res[<span style=\"color: #ae81ff;\">11241</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> vonkarman systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Consumed <span style=\"color: #ae81ff;\">1</span>h <span style=\"color: #ae81ff;\">17</span>min <span style=\"color: #ae81ff;\">34.650</span>s CPU time<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner sshd[<span style=\"color: #ae81ff;\">11342</span>]: Received disconnect <span style=\"color: #f92672;\">from</span> <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">60388</span>:<span style=\"color: #ae81ff;\">11</span>: disconnected by user Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner sshd[<span style=\"color: #ae81ff;\">11342</span>]: Disconnected <span style=\"color: #f92672;\">from</span> user root <span style=\"color: #ae81ff;\">192.168.1.172</span> port <span style=\"color: #ae81ff;\">60388</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner sshd[<span style=\"color: #ae81ff;\">11342</span>]: pam_unix(sshd:session): session closed <span style=\"color: #66d9ef;\">for</span> user root Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner res[<span style=\"color: #ae81ff;\">11464</span>]: res<span style=\"color: #f92672;\">/</span>get_hostInfo: ls_gethostinfo() failed<span style=\"color: #f92672;\">.</span> Server host LIM configuration <span style=\"color: #f92672;\">is</span><span style=\"color: #f92672;\">not</span> ready yet<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner systemd<span style=\"color: #f92672;\">-</span>logind[<span style=\"color: #ae81ff;\">383</span>]: Session <span style=\"color: #ae81ff;\">30</span> logged out<span style=\"color: #f92672;\">.</span> Waiting <span style=\"color: #66d9ef;\">for</span> processes to exit<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner res[<span style=\"color: #ae81ff;\">11464</span>]: cg_load_hierarchies: Please use the LSF package <span style=\"color: #66d9ef;\">with</span> higher glibc version to enable LSF cgroup v2 support<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Succeeded<span style=\"color: #f92672;\">.</span> Apr  <span style=\"color: #ae81ff;\">3</span> <span style=\"color: #ae81ff;\">21</span>:<span style=\"color: #ae81ff;\">44</span>:<span style=\"color: #ae81ff;\">59</span> wigner systemd[<span style=\"color: #ae81ff;\">1</span>]: lsfd<span style=\"color: #f92672;\">.</span>service: Consumed <span style=\"color: #ae81ff;\">1</span>h <span style=\"color: #ae81ff;\">17</span>min <span style=\"color: #ae81ff;\">44.610</span>s CPU time<span style=\"color: #f92672;\">.</span></code></pre></div></details><br /><!-- raw HTML omitted -->As expected, we observed that LSF log messages are written to the fromnet file. And importantly each entry contains the hostname, so that we can identify the origin of the message.</p><p><strong>Conclusion</strong></p><p>What started out as a chat about logging, grew into an idea of a blog, for which I am thankful for the collaboration of Peter. We’ve illustrated an example here of how to setup centralized logging on a Turing Pi system with syslog-ng to collect system and LSF logs.</p><p>Of course collecting log messages centrally is just the start of a journey. It is an important step as it allows for significantly easier debugging and troubleshooting. You can store logs to databases for easier search. And once you better understand which log messages are important, you can even potentially parse those and generate alersts from them or dashboards. All of these help you to make sure that your HPC system runs smoothly and with minimal downtime. For me this was a learning experience and I&rsquo;ll be looking how I can implement more broadly centralized logging in my home network.</p>",
            "url": "https://hpc.social/personal-blog/2024/centralized-system-and-lsf-logging-on-a-turing-pi-system/",
            
            
            
            
            
            "date_published": "2024-04-05T12:34:38-06:00",
            "date_modified": "2024-04-05T12:34:38-06:00",
            
                "author": "Ramblings of a supercomputing enthusiast."
            
        }
    
    ]
}
