hpc.social


High Performance Computing
Practitioners
and friends /#hpc
Share: 
This is a crosspost from   Blogs on Technical Computing Goulash Recent content in Blogs on Technical Computing Goulash. See the original post here.

Visualizing Spectrum LSF data with Grafana

Overview

System monitoring is a fundamental part of IT best practices. High performance computing (HPC) environments are no exception to this. At the high-end, HPC clusters can consist of thousands of servers, processing millions of jobs per day. HPC admins need ways to monitor the overall cluster to determine system status and availability through to the efficiency of workloads. Servers today produce a wide array of metrics which can be monitored for example to check for various conditions. Additionally, workload schedulers also produce a wealth of data about jobs. Having a single dashboard to show this type of detail can be of great benefit.

IBM Spectrum LSF Suites provide a complete solution for HPC workload management. This includes reporting capabilities out of the box. Spectrum LSF Suite features an integrated web interface for job management and reporting. The reporting capabilities include a number of reports out of the box, with the ability to customize and add new reports. The reporting capability in Spectrum LSF Suite and IBM Spectrum LSF Explorer is underpinned by Elasticsearch, which is used to store, index and query data. With LSF data in Elasticsearch, it’s also possible to configure LSF command-line interface (CLI) tools to query information from Elasticsearch rather than flat files – for greater performance. This is controlled via the LSF_QUERY_ES_FUNCTIONS parameter of Spectrum LSF. More details about the LSF_QUERY_ES_FUNCTIONS can be found in the LSF documentation here.

(1) Here is a look at the indices that are created by LSF in Elasticsearch. Note that the status shows as yellow because I only have a single Elasticsearch node.

# curl -XGET localhost:9200/_cat/indices
yellow open lsf_events-202205             tejh7jsMSwSeQUJzYM7cww 5 1    1137     0 808.1kb 808.1kb
yellow open lsf_jobs_pendingreason-202204 4wi7Ta8uQPSXlFBqPh4kOQ 5 1   90531     0   8.6mb   8.6mb
yellow open lsf_events-202204             tWYvW_w8TVyU1deRFOEoZg 5 1  116957 32691  59.1mb  59.1mb
yellow open lsf_jobs_active-202212        Q0pStQxvTgaeL7R-f02XWA 5 1  210052     0  50.6mb  50.6mb
yellow open lsf_jobs_pendingreason-202206 ENWIwfGrSqCHvi53aUQXJQ 5 1   44991     0   4.5mb   4.5mb
yellow open host_booleanres_latest        RE8thZCgTGeMBGodeMfXEQ 5 1       5     0  23.3kb  23.3kb
yellow open lsf_jobs_pendingreason-202205 yo0iZH_4TvOqq6kQgBluvA 5 1     111     0 181.4kb 181.4kb
yellow open lsf_jobs_pend-202212          9ViIS3nDRFewrqtILEbKTQ 5 1     707     0 446.9kb 446.9kb
yellow open lsf_hostconf_latest           9N1Y8ML4TiyaamCPEDRQog 5 1       2     0  10.6kb  10.6kb
yellow open lsf_events-202209             rtKQ8F4bSleHl8EbAQez8A 5 1    8200   955   4.4mb   4.4mb
yellow open lsf_events-202206             UUKPWfN7SZ-dzVs5NAkjUg 5 1   79503 23452  36.8mb  36.8mb
yellow open lsf_hostmetrics-202209        7FUNFCWPQtuGyx5jTJLb1A 5 1    4701     0   2.2mb   2.2mb
yellow open lsf_hostmetrics-202208        52xef_3hQWK-jVuJqyUpHA 5 1    3823     0   1.9mb   1.9mb
yellow open lsf_hostmetrics-202207        IqZYhU0RQNGIFWSRH-Ym8Q 5 1    6316     0   2.9mb   2.9mb
yellow open lsf_job_acct-202209           h1ZgCSB8RwCBxwIUUzDHEQ 5 1    2050   438   1.9mb   1.9mb
yellow open lsf_jobs_active-202209        iBfnf07CTcS7Gb6TxwomRA 5 1    2658     0     1mb     1mb
yellow open lsf_hostmetrics-202206        0PXSYBOgTA2Qa_zzaafUPg 5 1    4301     0   2.1mb   2.1mb
yellow open model                         xSqB_T_VSByOzYavEcEVyQ 1 1      55     0   257kb   257kb
yellow open lsf_job_acct-202206           C639GnzBSjCEVczfh5u23g 5 1   16719   353   8.9mb   8.9mb
yellow open lsf_jobs_active-202204        8gN_ENkQRTSfnmxrtMcOlA 5 1   33286     0   9.8mb   9.8mb
yellow open lsf_job_acct-202205           LOxmhm_8RxaCuTd7YWYbLw 5 1     274     0 439.4kb 439.4kb
yellow open lsf_jobs_active-202205        61u2RlXgR_SXagmZfrmttQ 5 1    1880     0   1.1mb   1.1mb
yellow open lsf_jobs_pend-202209          eTgqPp9nQOScNiwyUWXmHA 5 1       9     0 106.2kb 106.2kb
yellow open lsf_job_acct-202204           dDDegS6RQSWtWN99eklexg 5 1   28902  2177  17.4mb  17.4mb
yellow open lsf_jobs_active-202206        8ivkjWSNR1Sh_BxWACP0ZA 5 1   16921     0   4.6mb   4.6mb
yellow open lsf_current_status            92KE3V4YSJ-RtRp_kepxYg 5 1  115450     0     9mb     9mb
yellow open lsf_hostmetrics-202210        vbuK2wW3RRmXuY07tDPUNQ 5 1     785     0 942.1kb 942.1kb
yellow open lsf_jobs_pend-202206          OhSwn-b0SiSj8mCW5tcNIA 5 1      22     0 244.6kb 244.6kb
yellow open lsf_jobs_pend-202205          OfBtWklETYK9cRx000aNPw 5 1       1     0  12.7kb  12.7kb
yellow open lsf_events-202212             WUC5KJWmS-2WIN8XCQpSuw 5 1  712399 74728   337mb   337mb
yellow open lsf_jobs_pend-202204          OhUsXqohSciZTPZlTryMyA 5 1      50     0 275.3kb 275.3kb
yellow open resource_attributes_latest    R9bk_WIPTU62dVg3O1LDBA 5 1       5     0  24.4kb  24.4kb
yellow open lsf_jobs_pendingreason-202212 55iwDC5mRI-eRbzQLwWP6Q 5 1 3314828     0 288.7mb 288.7mb
yellow open pa-lite-log                   o8-jaNoGTsSVcjJW5Ufs0w 5 1    1549     0 547.2kb 547.2kb
yellow open lsf_job_acct-202212           4HXvAD02Sxq0tgp2fS2cfQ 5 1  161502     0  73.6mb  73.6mb
yellow open lsf_hostmetrics-202212        Tki6OJ41R363u9Tx02N4zw 5 1    2548     0   1.7mb   1.7mb
yellow open lsf_jobs_pendingreason-202209 D3TOZY2ORiK9PppGVt10Fg 5 1    2511     0 381.4kb 381.4kb

(2) With the LSF data stored in Elasticsearch, the next step is to connect to the Grafana server. Here we point our browser to the Grafana server on the default port: http://lsf_manager:3000 and login to Grafana. This step assumes an account has already been setup on Grafana. Here we are using the default admin account.

(3) In Grafana, navigate to Configuration -> Data sources. It’s here that it will be possible to add an Elasticsearch data source

(4) Next, click the Add data source button.

(5 In the list of data sources, filter by name for Elasticsearch and click the Select button on the Elasticsearch entry.

(6) When configuring the data source, it’s necessary to specify an index name. This is where the list of indices in Elasticsearch that we generated earlier will come in handy. For this example, we wish to display the total number of pending jobs in the Spectrum LSF cluster over time. This data is stored in the lsf_jobs_pend* indices in Elasticsearch. To configure the data source appropriately, we specify the following values:

Next click on the Save & Test button. It should return the message Index OK. Time field name OK..

Assuming that no errors were found, click on the Back button.

(7) Now you should see LSF pending jobs listed as a Data Source.

(8) With the data source configured, we’re now ready to configure a dashboard to display the LSF pending job information. Navigate to Create -> Dashboard.

(9) Click on Add an empty panel. This is used to create a new panel where the LSF pending job information will be plotted.

(10) In the panel editor, specify the following options:

You should immediately see in the panel editor the plot of the hourly pending jobs. Click on the Apply button to save the changes.

(11) After clicking Apply, you will be returned to the Dashboard screen. The Dashboard should now display the new LSF pending jobs panel that was created above. This Dashboard could also include panels for system metrics collected by Prometheus for example.

(12) Next, click on the diskette icon in the upper right to save the Dashboard with the LSF pending jobs panel. We’ll name it Spectrum LSF cluster status.

Additional panels can be added to the Spectrum LSF cluster status based on the data logged by Spectrum LSF to Elasticsearch.

That concludes the simple example of plotting Spectrum LSF cluster data from Elasticsearch in Grafana. As mentioned, the IBM Spectrum LSF Suites integrated web interface also provides reporting capabilities, with several built-in reports provided out of the box. Below, we’ve included a screenshot of the pending job analysis report included with Spectrum LSF Suites.

Summary

Spectrum LSF provides many hooks and integration points enabling administrators to change things ranging from scheduling behavior and the output of query commands through to job information being logged to Elasticsearch. Spectrum LSF is highly customizable by organizations to suit specific needs and requirements. We’ve demonstrated this using Grafana to visualize data from the LSF scheduler in a simple example. Following the above example, administrators can combine existing HPC cluster system level reporting in Grafana with job information from Spectrum LSF for a better overall view and understanding of the infrastructure.