hpc.social


High Performance Computing
Practitioners
and friends /#hpc
Share: 
This is a crosspost from   Blogs on Technical Computing Goulash Recent content in Blogs on Technical Computing Goulash. See the original post here.

Customizing command output in IBM Spectrum LSF

IBM Spectrum LSF provides many ways to query the LSF cluster for information about workloads. As a user, once you’ve submitted a job to LSF, it’s logical to want to understand what has happened to your job. Has the job started yet? Is the job pending? If so, why is it pending? And the all important, “Is my job done yet?”. Of course, LSF provides a very rich CLI which has been developed and refined - over the past three decades. It’s also possible to get JSON-formatted output from various LSF query commands. This is useful for users and administrators alike as JSON-formatted output is easy to parse, and scripting can be used to extract values from the JSON output.

This is not meant to be a definitive guide on how to query information in LSF, but rather provides some examples of the various ways that users can query job related information using the LSF CLI. This will include a look at the -json and -o options which have been introduced during the lifecycle of LSF v10.1.0 family. The -json option can be used to provide JSON-formatted output from various LSF query commands and the -o can be used to customize the fields in the output to only those desired.

We’ll start with a simple job submission. Here we submit a test workload as a non-root user in the LSF cluster.

$ bsub -o $HOME/output.%J -e $HOME/error.%J ./testjob.sh
Job <24520> is submitted to default queue <normal>.

With the unique jobID number 24520, we can now query LSF for information about the job:

$ bjobs 24520
JOBID   USER    STAT  QUEUE      FROM_HOST   EXEC_HOST   JOB_NAME   SUBMIT_TIME
24520   gsamu   RUN   normal     kilenc      kilenc      *estjob.sh May 10 21:09

Adding the -l option to bjobs provides the long output (more details).

$ bjobs -l 24520

Job <24520>, User <gsamu>, Project <default>, Status <RUN>, Queue <normal>, Com
                     mand <./testjob.sh>, Share group charged </gsamu>
Tue May 10 21:09:22: Submitted from host <kilenc>, CWD <$HOME>, Output File </h
                     ome/gsamu/output.24520>, Error File </home/gsamu/error.245
                     20>;
Tue May 10 21:09:23: Started 1 Task(s) on Host(s) <kilenc>, Allocated 1 Slot(s)
                      on Host(s) <kilenc>, Execution Home </home/gsamu>, Execut
                     ion CWD </home/gsamu>;
Tue May 10 21:10:01: Resource usage collected.
                     MEM: 12 Mbytes;  SWAP: 0 Mbytes;  NTHREAD: 5
                     PGID: 313588;  PIDs: 313588 313589 313591 313592 


 MEMORY USAGE:
 MAX MEM: 12 Mbytes;  AVG MEM: 10 Mbytes

 SCHEDULING PARAMETERS:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 RESOURCE REQUIREMENT DETAILS:
 Combined: select[type == local] order[r15s:pg]
 Effective: select[type == local] order[r15s:pg] 

It is possible to customize the output format of the bjobs command using the -o option. In this case, we want to show only some specific details about the job in the output of bjobs. We’ve selected to view: jobID, job status, project name, memory consumed, output and error files. A full list of the available fields for the custom format can be found here.

$ bjobs -o "jobid stat: queue:- project:10  mem:12:G output_file error_file" 24520
JOBID STAT       QUEUE PROJ_NAME  MEM          OUTPUT_FILE ERROR_FILE
24520 RUN       normal default    0.01 G       /home/gsamu/output.24520 /home/gsamu/error.24520

Adding the -json option, it’s possible to get this customized job output in JSON format.

$ bjobs -o "jobid stat: queue:- project:10  mem:12:G output_file error_file" -json 24520
{
  "COMMAND":"bjobs",
  "JOBS":1,
  "RECORDS":[
    {
      "JOBID":"24520",
      "STAT":"RUN",
      "QUEUE":"normal",
      "PROJ_NAME":"default",
      "MEM":"0.01 G",
      "OUTPUT_FILE":"\/home\/gsamu\/output.24520",
      "ERROR_FILE":"\/home\/gsamu\/error.24520"
    }
  ]
}

Next, let’s look at the bhist command. This can be used to view historical data about jobs.

$ bhist 24520
Summary of time in seconds spent in various states:
JOBID   USER    JOB_NAME  PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
24520   gsamu   *tjob.sh  1       0       457     0       0       0       458       

We see that the job command has been truncated. Let’s now run bhist again with the -w option to produce a wide output.

$ bhist -w 24520
Summary of time in seconds spent in various states:
JOBID   USER    JOB_NAME  PEND    PSUSP   RUN     USUSP   SSUSP   UNKWN   TOTAL
24520   gsamu   ./testjob.sh 1       0       462     0       0       0       463       

And finally, with the -l option to produce a long, detailed output.

$ bhist -l 24520

Job <24520>, User <gsamu>, Project <default>, Command <./testjob.sh>
Tue May 10 21:09:22: Submitted from host <kilenc>, to Queue <normal>, CWD <$HOM
                     E>, Output File </home/gsamu/output.%J>, Error File </home
                     /gsamu/error.%J>;
Tue May 10 21:09:23: Dispatched 1 Task(s) on Host(s) <kilenc>, Allocated 1 Slot
                     (s) on Host(s) <kilenc>, Effective RES_REQ <select[type ==
                      local] order[r15s:pg] >;
Tue May 10 21:09:25: Starting (Pid 313588);
Tue May 10 21:09:25: Running with execution home </home/gsamu>, Execution CWD <
                     /home/gsamu>, Execution Pid <313588>;


Summary of time in seconds spent in various states by  Tue May 10 21:17:26
  PEND     PSUSP    RUN      USUSP    SSUSP    UNKWN    TOTAL
  1        0        483      0        0        0        484         

When the job is done, the bacct command can be used to get detailed accounting information for jobs.

$ bacct 24520

Accounting information about jobs that are: 
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
  - accounted to all RC accounts.
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second ) 
 Total number of done jobs:       1      Total number of exited jobs:     0
 Total CPU time consumed:       3.4      Average CPU time consumed:     3.4
 Maximum CPU time of a job:     3.4      Minimum CPU time of a job:     3.4
 Total wait time in queues:     1.0
 Average wait time in queue:    1.0
 Maximum wait time in queue:    1.0      Minimum wait time in queue:    1.0
 Average turnaround time:       669 (seconds/job)
 Maximum turnaround time:       669      Minimum turnaround time:       669
 Average hog factor of a job:  0.01 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.01      Minimum hog factor of a job:  0.01
 Average expansion factor of a job:  1.00 ( turnaround time / run time )
 Maximum expansion factor of a job:  1.00
 Minimum expansion factor of a job:  1.00
 Total Run time consumed:       668      Average Run time consumed:     668
 Maximum Run time of a job:     668      Minimum Run time of a job:     668
 Scheduler Efficiency for 1 jobs
 Slot Utilization:          100.00%  Memory Utilization:            100.00% 

And now the long, detailed output from bacct using the -l parameter.

$ bacct -l 24520

Accounting information about jobs that are: 
  - submitted by all users.
  - accounted on all projects.
  - completed normally or exited
  - executed on all hosts.
  - submitted to all queues.
  - accounted on all service classes.
  - accounted to all RC accounts.
------------------------------------------------------------------------------

Job <24520>, User <gsamu>, Project <default>, Status <DONE>, Queue <normal>, Co
                     mmand <./testjob.sh>, Share group charged </gsamu>
Tue May 10 21:09:22: Submitted from host <kilenc>, CWD <$HOME>, Output File </h
                     ome/gsamu/output.%J>, Error File </home/gsamu/error.%J>;
Tue May 10 21:09:23: Dispatched 1 Task(s) on Host(s) <kilenc>, Allocated 1 Slot
                     (s) on Host(s) <kilenc>, Effective RES_REQ <select[type ==
                      local] order[r15s:pg] >;
Tue May 10 21:20:31: Completed <done>.

Accounting information about this job:
     Share group charged </gsamu>
     CPU_T     WAIT     TURNAROUND   STATUS     HOG_FACTOR    MEM    SWAP
      3.37        1            669     done         0.0050    12M      0M
------------------------------------------------------------------------------

SUMMARY:      ( time unit: second ) 
 Total number of done jobs:       1      Total number of exited jobs:     0
 Total CPU time consumed:       3.4      Average CPU time consumed:     3.4
 Maximum CPU time of a job:     3.4      Minimum CPU time of a job:     3.4
 Total wait time in queues:     1.0
 Average wait time in queue:    1.0
 Maximum wait time in queue:    1.0      Minimum wait time in queue:    1.0
 Average turnaround time:       669 (seconds/job)
 Maximum turnaround time:       669      Minimum turnaround time:       669
 Average hog factor of a job:  0.01 ( cpu time / turnaround time )
 Maximum hog factor of a job:  0.01      Minimum hog factor of a job:  0.01
 Average expansion factor of a job:  1.00 ( turnaround time / run time )
 Maximum expansion factor of a job:  1.00
 Minimum expansion factor of a job:  1.00
 Total Run time consumed:       668      Average Run time consumed:     668
 Maximum Run time of a job:     668      Minimum Run time of a job:     668
 Scheduler Efficiency for 1 jobs
 Slot Utilization:          100.00%  Memory Utilization:            100.00% 

From jobs to queues

We’ve looked briefly at querying LSF for job related information. Let’s now take a closer look at querying LSF for information regarding the queue configuration. Batch queues are where users submit jobs to. Queues can have a very wide array of attributes and settings. Below we see a listing of the default queues configured in LSF Suite for HPC. The bqueues command is used to query LSF for the queue configuration.

$ bqueues
QUEUE_NAME      PRIO STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN  SUSP 
admin            50  Open:Active       -    -    -    -     0     0     0     0
owners           43  Open:Active       -    -    -    -     0     0     0     0
priority         43  Open:Active       -    -    -    -     0     0     0     0
night            40  Open:Active       -    -    -    -     0     0     0     0
short            35  Open:Active       -    -    -    -     0     0     0     0
dataq            33  Open:Active       -    -    -    -     0     0     0     0
normal           30  Open:Active       -    -    -    -     0     0     0     0
interactive      30  Open:Active       -    -    -    -     0     0     0     0
idle             20  Open:Active       -    -    -    -     0     0     0     0

The -l option of bqueues can be used to get a more details view about the queues. Here, we look at the long output for the queue normal.

$ bqueues -l normal

QUEUE: normal
  -- For normal low priority jobs, running only if hosts are lightly loaded.  This is the default queue.

PARAMETERS/STATISTICS
PRIO NICE STATUS          MAX JL/U JL/P JL/H NJOBS  PEND   RUN SSUSP USUSP  RSV PJOBS 
 30    0  Open:Active       -    -    -    -     0     0     0     0     0    0     0
Interval for a host to accept two jobs is 0 seconds

SCHEDULING PARAMETERS
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

SCHEDULING POLICIES:  FAIRSHARE  NO_INTERACTIVE
USER_SHARES:  [default, 1] 

SHARE_INFO_FOR: normal/
 USER/GROUP   SHARES  PRIORITY  STARTED  RESERVED  CPU_TIME  RUN_TIME   ADJUST  GPU_RUN_TIME
gsamu           1       0.333      0        0         0.0        0       0.000             0
elasticsearch     1       0.333      0        0         0.0        0       0.000             0

USERS: all  
HOSTS:  all 

Custom output formatting can also be used for the bqueues command. Below is an example of the use of custom output formatting using the -o parameter. For this example, we display queue name, status and the number of jobs (all states). More details about the bqueues -o parameter can be found here.

$ bqueues -o "queue_name:12 status:12 njobs"
QUEUE_NAME   STATUS       NJOBS
admin        Open:Active  0
owners       Open:Active  0
priority     Open:Active  0
night        Open:Active  0
short        Open:Active  0
dataq        Open:Active  0
normal       Open:Active  0
interactive  Open:Active  0
idle         Open:Active  0

And for JSON-formatted output, we add the -json parameter.

$ bqueues -json -o "queue_name:12 status:12 njobs"
{
  "COMMAND":"bqueues",
  "QUEUES":9,
  "RECORDS":[
    {
      "QUEUE_NAME":"admin",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"owners",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"priority",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"night",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"short",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"dataq",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"normal",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"interactive",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    },
    {
      "QUEUE_NAME":"idle",
      "STATUS":"Open:Active",
      "NJOBS":"0"
    }
  ]
}

From queues to servers

Finally, we’ll look at the LSF bhosts command, which is used to display information about the batch hosts in the LSF cluster.

$ bhosts
HOST_NAME          STATUS       JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV 
archie             ok              -      2      0      0      0      0      0
kilenc             ok              -     32      0      0      0      0      0

To view detailed information about a batch host, the -l parameter can be specified for bhosts. Here we query for information on host archie.

$ bhosts -l archie
HOST  archie
STATUS           CPUF  JL/U    MAX  NJOBS    RUN  SSUSP  USUSP    RSV DISPATCH_WINDOW
ok               6.00     -      2      0      0      0      0      0      -

 CURRENT LOAD USED FOR SCHEDULING:
                r15s   r1m  r15m    ut    pg    io   ls    it   tmp   swp   mem  slots  ngpus
 Total           0.0   0.0   0.0    0%   0.0     1    1   437 3456M    0M  1.7G      2    0.0
 Reserved        0.0   0.0   0.0    0%   0.0     0    0     0    0M    0M    0M      -     - 

               ngpus_physical gpu_shared_avg_ut gpu_shared_avg_mut gpu_mode0
 Total                    0.0               0.0                0.0       0.0
 Reserved                  -                 -                  -         - 

               gpu_mode1 gpu_mode2 gpu_mode3 gpu_mode4 gpu_mode5 gpu_mode6
 Total               0.0       0.0       0.0       0.0       0.0       0.0
 Reserved             -         -         -         -         -         - 

               gpu_mode7 gpu_temp0 gpu_temp1 gpu_temp2 gpu_temp3 gpu_temp4
 Total               0.0       0.0       0.0       0.0       0.0       0.0
 Reserved             -         -         -         -         -         - 

               gpu_temp5 gpu_temp6 gpu_temp7 gpu_ecc0 gpu_ecc1 gpu_ecc2 gpu_ecc3
 Total               0.0       0.0       0.0      0.0      0.0      0.0      0.0
 Reserved             -         -         -        -        -        -        - 

               gpu_ecc4 gpu_ecc5 gpu_ecc6 gpu_ecc7 gpu_ut0 gpu_ut1 gpu_ut2 gpu_ut3
 Total              0.0      0.0      0.0      0.0     0.0     0.0     0.0     0.0
 Reserved            -        -        -        -       -       -       -       - 

               gpu_ut4 gpu_ut5 gpu_ut6 gpu_ut7 gpu_mut0 gpu_mut1 gpu_mut2 gpu_mut3
 Total             0.0     0.0     0.0     0.0      0.0      0.0      0.0      0.0
 Reserved           -       -       -       -        -        -        -        - 

               gpu_mut4 gpu_mut5 gpu_mut6 gpu_mut7 gpu_mtotal0 gpu_mtotal1
 Total              0.0      0.0      0.0      0.0         0.0         0.0
 Reserved            -        -        -        -           -           - 

               gpu_mtotal2 gpu_mtotal3 gpu_mtotal4 gpu_mtotal5 gpu_mtotal6
 Total                 0.0         0.0         0.0         0.0         0.0
 Reserved               -           -           -           -           - 

               gpu_mtotal7 gpu_mused0 gpu_mused1 gpu_mused2 gpu_mused3 gpu_mused4
 Total                 0.0        0.0        0.0        0.0        0.0        0.0
 Reserved               -          -          -          -          -          - 

               gpu_mused5 gpu_mused6 gpu_mused7 gpu_maxfactor
 Total                0.0        0.0        0.0           0.0
 Reserved              -          -          -             - 


 LOAD THRESHOLD USED FOR SCHEDULING:
           r15s   r1m  r15m   ut      pg    io   ls    it    tmp    swp    mem
 loadSched   -     -     -     -       -     -    -     -     -      -      -  
 loadStop    -     -     -     -       -     -    -     -     -      -      -  

 
 CONFIGURED AFFINITY CPU LIST: all

Similar to bjobs and bqueues, the -o parameter can be used for custom formatting of output of bhosts. Below is an example of the use of custom output formatting using the -o parameter. For this example, we display host name, status and the number of jobs (all states). More details about the bqueues -o parameter can be found here.

$ bhosts -o "host_name:12 status:12 njobs" 
HOST_NAME    STATUS       NJOBS
archie       ok           0
kilenc       ok           0

And adding the -json parameter for JSON-formatted output.

$ bhosts -json -o "host_name:12 status:12 njobs" 
{
  "COMMAND":"bhosts",
  "HOSTS":2,
  "RECORDS":[
    {
      "HOST_NAME":"archie",
      "STATUS":"ok",
      "NJOBS":"0"
    },
    {
      "HOST_NAME":"kilenc",
      "STATUS":"ok",
      "NJOBS":"0"
    }
  ]
}

That concludes our brief look at LSF query commands. We’ve only scratched the surface here in terms of capabilities and query commands for LSF. The LSF command line interface is powerful and flexible including ways to customize the command outputs and to output in JSON-format. For more details, the complete set of IBM Spectrum LSF documentation can be found online at IBM Documentation here.