Quantcast
Channel: Intel Developer Zone Articles
Viewing all articles
Browse latest Browse all 3384

Tips to Improve Performance for Popular Deep Learning Frameworks on CPUs

$
0
0

Introduction

The purpose of this document is to help developers speed up the execution of the programs that use popular deep learning frameworks in the background. There are situations where we have observed that the deep learning code, with default settings, does not take advantage of the full compute capability of the underlying machine on which it runs. This is often the case, especially when the code runs on Intel® Xeon® processors.

Optimization

The primary goal of the performance optimization tips given in this section is to make use of all the cores available in the machine. Intel® DevCloud consists of Intel® Xeon® Gold 6128 processors.

Assume that the number of cores per socket in the machine is denoted as NUM_PARALLEL_EXEC_UNITS. On the Intel DevCloud, assign NUM_PARALLEL_EXEC_UNITS to 6.

TensorFlow

To get the best performance from a machine, change the parallelism threads and OpenMP* settings as below:

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS})

session = tf.Session(config=config)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Keras with TensorFlow Backend

To get the best performance from a machine, change the parallelism threads and OpenMP settings as below:

from keras import backend as K

import tensorflow as tf

config = tf.ConfigProto(intra_op_parallelism_threads=NUM_PARALLEL_EXEC_UNITS, inter_op_parallelism_threads=2, allow_soft_placement=True, device_count = {'CPU': NUM_PARALLEL_EXEC_UNITS })

session = tf.Session(config=config)

K.set_session(session)

os.environ["OMP_NUM_THREADS"] = "NUM_PARALLEL_EXEC_UNITS"

os.environ["KMP_BLOCKTIME"] = "30"

os.environ["KMP_SETTINGS"] = "1"

os.environ["KMP_AFFINITY"]= "granularity=fine,verbose,compact,1,0"

Caffe

To get the best performance from the underlying machine, change the OpenMP settings as below:

export OMP_NUM_THREADS= NUM_PARALLEL_EXEC_UNITS

export KMP_AFFINITY= granularity=fine,verbose,compact,1,0

In general:

export OMP_NUM_THREADS= <number of threads to use>

export KMP_AFFINITY= <your affinity settings of choice>

For example:

KMP_AFFINITY=granularity=fine,balanced

KMP_AFFINITY=granularity=fine,compact

Conclusion

Even though we have observed a speed up in most cases, please note that the performance is largely code-dependent and there can be multiple other reasons that affect the code performance. A good code profiling tool like the Intel® VTune™ Amplifier can help you dig deeper and analyze performance problems.

Author

Anju Paul is a Technical Solutions Engineer working on behalf of the Intel AI® Academia Program

 

 


Viewing all articles
Browse latest Browse all 3384

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>