Authors: Ping Tak Peter Tang, Jongsoo Park, Daehyun Kim, Vladimir Petrov - Intel Corporation
This was selected as a best paper finalist during the Supercomputing 2012 conference and discusses algorithmic modifications to reduce internode data exchange needs, to greatly enhance performance of 1-D FFT (Fast Fourier Transform) algorithms.
Abstract
In high-performance computing on distributed memory systems, communication often represents a significant part of the overall execution time. The relative cost of communication will certainly continue to rise as compute-density growth follows the current technology and industry trends. Design of lower-communication alternatives to fundamental computational algorithms has become an important field of research. For distributed 1-D FFT, communication cost has hitherto remained high as all industry-standard implementations perform three allto- all internode data exchanges (also called global transposes).
These communication steps indeed dominate execution time. In this paper, we present a mathematical framework from which many single-all-to-all and easy-to-implement 1-D FFT algorithms can be derived. For large-scale problems, our implementation can be twice as fast as leading FFT libraries on state-of-the-art computer clusters. Moreover, our framework allows tradeoff between accuracy and performance, further boosting performance if reduced accuracy is acceptable.