Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Executing Intel® Cluster Checker 2.x through a job scheduler

$
0
0

Intel® Cluster Checker can be integrated with a job scheduler or resource manager. Special consideration must be made for node configuration. This article will show an example that works in OpenPBS or TORQUE.

Intel Cluster Checker stores node configuration as comments within its node list. The node list produced by a job scheduler lacks this configuration information. To work around this issue, it is necessary to combine the two files into a single runtime node list. Additional steps must be added to the job script prior to executing clck.

You must confirm that the hostnames in the Intel Cluster Checker node list and the scheduler's node list are identical. If using TORQUE, names in the Intel Cluster Checker node list must exactly match those output by the pbsnodes command.

When submitting the job, lock-out the entire node, regardless of processors or cores available. Intel Cluster Checker performance tests can impact execution performance of other apps running on the same node. If you cannot do this, then limit Intel Cluster Checker to a level 1 or level 2 health check only. That shouldn’t disturb other jobs running on the same node.

The job script will need to do the following:

  • Eliminate any duplicate lines on the job node list. TORQUE will output the hostname once for each process. Create a new temporary file instead of altering the list directly.
  • Alter the Intel Cluster Checker node list to use only nodes in the job node list. This can be created using grep.
  • Set the Intel Cluster Checker log directory. This saves debug files generated during execution.
  • Run clck using the temporary node list.

Example

Here is an example TORQUE 4.2.8 script for a level 4 health check. In this example, the Intel Cluster Checker node list is named “nodes.list” and the XML configuration file is named “config.xml”. Change these to match your file names.

#PBS -V
#PBS -N clckjob
#PBS -q queue_name
#PBS -e localhost:$HOME/clckjob.err
#PBS -o localhost:$HOME/clckjob.log
#PBS -l nodes=5
### Lock nodes to a single job
#PBS -l naccesspolicy=SINGLEJOB
#PBS -l walltime=00:05:00
### clck requires a Bourne-compatible shell
#PBS -S /bin/sh

### Clean up old runs
rm –f tmp_nodelist clck_joblist
### Create log directory
mkdir –p $HOME/CLCK_LOGS
### Set log environment
export CLCK_LOG_DIR= $HOME/CLCK_LOGS
### Sort scheduler node list and remove duplicate lines
sort $PBS_NODEFILE | uniq > tmp_nodelist
### Create updated node list; If a node is listed in tmp_nodelist
### then copy the entry from the configuration node list to the runtime node list
grep –Fwf tmp_nodelist nodes.list > clck_joblist
### Run Intel Cluster Checker
clck -t -L 4 -D -F clck_joblist –c $HOME/config.xml
echo "End of job"

Note: There is an unlikely, but possible scenario: two valid node names exist at the beginning and end of a line in the Intel Cluster Checker node list. This may cause a problem when using the example script. For example:

node01 #type : compute #Same hardware as node02

In this situation, the grep command will need to use a more thorough regular expression or the line in the node list will need to be changed. This change would resolve the problem:

node01 #type : compute
#node01 uses the same hardware as node02

Viewing all articles
Browse latest Browse all 3384

Trending Articles