Introduction
The purpose of this document is to help developers speed up the execution of the programs that use popular deep learning frameworks in the background. There are situations where we have observed that the deep learning code, with default settings, does not take advantage of the full compute capability of the underlying machine on which it runs. This is often the case, especially when the code runs on Intel® Xeon® processors.
Optimization
The primary goal of the performance optimization tips given in this section is to make use of all the cores available in the machine. Intel® DevCloud consists of Intel® Xeon® Gold 6128 processors.
Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS. On the Intel DevCloud, assign NUM_PARALLEL_EXEC_UNITS to 6.
TensorFlow
To get the best performance from a machine, change the parallelism threads and OpenMP* settings as below:
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})
session = tf.Session(config=config)
os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
Keras with TensorFlow Backend
To get the best performance from a machine, change the parallelism threads and OpenMP settings as below:
from keras import backend as K
import tensorflow as tf
config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS })
session = tf.Session(config=config)
K.set_session(session)
os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"
os.environ["KMP_BLOCKTIME"] = "30"
os.environ["KMP_SETTINGS"] = "1"
os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"
Caffe
To get the best performance from the underlying machine, change the OpenMP settings as below:
export OMP_NUM_THREADS= NUM_PARALLEL_EXEC_UNITS
export KMP_AFFINITY= granularity=fine,verbose,compact,1,0
In general:
export OMP_NUM_THREADS= <number of threads to use>
export KMP_AFFINITY= <your affinity settings of choice>
For example:
KMP_AFFINITY=granularity=fine,balanced
KMP_AFFINITY=granularity=fine,compact
Conclusion
Even though we have observed a speed up in most cases, please note that the performance is largely code-dependent and there can be multiple other reasons that affect the code performance. A good code profiling tool like the Intel® VTune™ Amplifier can help you dig deeper and analyze performance problems.
Author
Anju Paul is a Technical Solutions Engineer working on behalf of the Intel AI® Academia Program