Intel® Cluster Checker can be integrated with a job scheduler or resource manager. Special consideration must be made for node configuration. This article will show an example that works in OpenPBS or TORQUE.
Intel Cluster Checker stores node configuration as comments within its node list. The node list produced by a job scheduler lacks this configuration information. To work around this issue, it is necessary to combine the two files into a single runtime node list. Additional steps must be added to the job script prior to executing clck.
You must confirm that the hostnames in the Intel Cluster Checker node list and the scheduler's node list are identical. If using TORQUE, names in the Intel Cluster Checker node list must exactly match those output by the pbsnodes command.
When submitting the job, lock-out the entire node, regardless of processors or cores available. Intel Cluster Checker performance tests can impact execution performance of other apps running on the same node. If you cannot do this, then limit Intel Cluster Checker to a level 1 or level 2 health check only. That shouldn’t disturb other jobs running on the same node.
The job script will need to do the following:
- Eliminate any duplicate lines on the job node list. TORQUE will output the hostname once for each process. Create a new temporary file instead of altering the list directly.
- Alter the Intel Cluster Checker node list to use only nodes in the job node list. This can be created using grep.
- Set the Intel Cluster Checker log directory. This saves debug files generated during execution.
- Run clck using the temporary node list.
Example
Here is an example TORQUE 4.2.8 script for a level 4 health check. In this example, the Intel Cluster Checker node list is named “nodes.list” and the XML configuration file is named “config.xml”. Change these to match your file names.
#PBS -V #PBS -N clckjob #PBS -q queue_name #PBS -e localhost:$HOME/clckjob.err #PBS -o localhost:$HOME/clckjob.log #PBS -l nodes=5 ### Lock nodes to a single job #PBS -l naccesspolicy=SINGLEJOB #PBS -l walltime=00:05:00 ### clck requires a Bourne-compatible shell #PBS -S /bin/sh ### Clean up old runs rm –f tmp_nodelist clck_joblist ### Create log directory mkdir –p $HOME/CLCK_LOGS ### Set log environment export CLCK_LOG_DIR= $HOME/CLCK_LOGS ### Sort scheduler node list and remove duplicate lines sort $PBS_NODEFILE | uniq > tmp_nodelist ### Create updated node list; If a node is listed in tmp_nodelist ### then copy the entry from the configuration node list to the runtime node list grep –Fwf tmp_nodelist nodes.list > clck_joblist ### Run Intel Cluster Checker clck -t -L 4 -D -F clck_joblist –c $HOME/config.xml echo "End of job"
Note: There is an unlikely, but possible scenario: two valid node names exist at the beginning and end of a line in the Intel Cluster Checker node list. This may cause a problem when using the example script. For example:
node01 #type : compute #Same hardware as node02
In this situation, the grep command will need to use a more thorough regular expression or the line in the node list will need to be changed. This change would resolve the problem:
node01 #type : compute #node01 uses the same hardware as node02