Intel® Cluster Checker tool evaluates HPC clusters for consistency, functionality and performance. This includes capability for evaluating the hardware configuration of Intel® Xeon Phi™ coprocessors.
This module will describe the use of Intel® Cluster Checker to:
- Confirm that coprocessor hardware and firmware is consistent across a cluster or user-defined subclusters.
- Validate that Intel® MPSS (Many-Core Platform Software Stack) version exactly matches a value.
This article assumes that the software stack on each node is identical, except for those files that define node identity. Intel® MPSS is installed on all nodes, although it will not be active for nodes without coprocessors. Intel® Cluster Checker will report an error if software differences are detected.
Implementing the basic check
Intel® Xeon Phi™ coprocessor uniformity is evaluated using the micinfo and miccheck utilities, which are included with the Intel® MPSS installation. Intel® Cluster Checker executes these utilities in parallel on each node, then analyzes the collected results.
The basic check confirms that micinfo returns identical results for each node and that miccheck returns no errors.
The <micconf> test module will be automatically executed during any level 4 or 5 health check. To execute the <micconf> test module in other situations, add the following child element to the <cluster> element in the configuration file.
<include_module>micconf</include_module>
Or, use the --include micconf option on the clck command line.
For a homogenous cluster, no changes are usually needed to the default configuration file.
Validating the MPSS version
The MPSS version can be specified using the <mpss-version> tag. For example, add the following to the configuration file.
<test> <micconf> <mpss-version>3.2</mpss-version> </micconf> . . . </test>
Intel® Cluster Checker will confirm that all host systems have this version of Intel® MPSS installed.
Heterogeneous Clusters
It is common for clusters nodes to have different configurations of coprocessors. For example, a subset of nodes may be enabled with coprocessors or with different numbers of coprocessors. One or two nodes in a cluster containing coprocessors are referred to as “enhanced capability” nodes or “fat” nodes. Larger numbers of enhanced capability nodes constitute a subcluster.
In order for Intel® Cluster Checker to correctly check heterogeneous clusters, the group function is used. Enhanced capability nodes and subclusters should be defined as separate node groups for purposes of these functions.
To identify host nodes containing coprocessors, add group information to the node list. In the following example of a 4+1 node cluster, three nodes host a single Intel® Xeon Phi™ coprocessor. One node hosts two coprocessors and the head node hosts none. Two groups are used to identify these differences.
master #type: head node01 #type: compute group: 1mic node01-mic0 #arch: k1om node02 #type: compute group: 1mic node02-mic0 #arch: k1om node03 #type: compute group: 1mic node03-mic0 #arch: k1om node04 #type: compute group: 2mic node04-mic0 #arch: k1om node04-mic1 #arch: k1om
The defined groups are then used to modify the XML configuration file. Modules that need to be modified for different quantities or types of coprocessors are:
- <hardware>
- <kernel>
- <micinfo>
Areas of the configuration file which should be evaluated independently should be identified using the <group> element.
<cluster> <include_module>micconf</include_module> <test> <hardware> <group name=”1mic”/> <group name=”2mic”/> </hardware> <kernel> <group name=”1mic”/> <group name=”2mic”/> </kernel> <micconf> <group name=”1mic”/> <group name=”2mic”/> <micconf> . . . </test></cluster>
Use of the “1mic” group is optional in this case, since all nodes not within a group will be evaluated as part of the default group.
Groups are tested independently. If uniform software versions are required throughout a cluster, it is possible to miss differences, especially if a group contains only one or two nodes. To avoid this possibility, specify the expected version in the configuration file.