Principal Investigators:
Prof. William Tang, the PI of this project, originally began the Princeton Gyrokinetic Toroidal Code (GTC-P) project in 2008 with the goal of producing a modern HPC application code capable of delivering discovery science for increasing problem size by effective utilization of the most advanced supercomputing platforms. He was also the U.S. PI for the National Science Foundation-supported G8 Exascale Computing for Global Scale Projects Program in Fusion Energy that successfully ported GTC-P to leading HPC systems in Europe and Japan as well as in the US. This activity has currently been extended to top supercomputing systems worldwide to carry out comparative performance studies with “time to solution” and “energy to solution” as the relevant metrics.
Dr. Bei Wang is the current lead developer for the GTC-P code and has extensive experience in porting and optimizing the code on a variety of multi-core and many-core systems worldwide. Most recently, she has successfully ported the code to Stampede’s Intel® Xeon Phi™ Coprocessor system at NSF’s Texas Advanced Computing Center and at the world-leading Tianhe – 2 system in China. Significant results operating in symmetric mode have been obtained, and active development of a more efficient offload mode implementation is currently in progress. More recently, she has actively collaborated on GTC-P performance studies with the Intel® PCC in ETH-Zurich to significantly advance progress in this key area.
Dr. Khaled Ibrahim, a Computer Science expert in performance modeling and simulation acceleration in the computer science division of the University of California Lawrence Berkeley National Lab (LBNL), has been the lead member of the CS team there engaged specifically in active collaborations with Princeton on modernizing the GTC-P code. In particular, he has led the R&D efforts that have enabled GTC-P to exploit the optimization of “scatter” and “gather” operations on modern multi-core and many-core systems. He will also explore the best way to effectively use the cache and memory hierarchy in the Xeon Phi architectures.
Dr. Carlos Rosales is Co-Director of the Advanced Computing Evaluation Laboratory at the TACC, where his main responsibility is the evaluation of new computer architectures relevant to High Performance Computing. His areas of expertise are benchmarking, code optimization, and computational fluid dynamics. Dr. Rosales has worked on code optimization for the Intel® Xeon Phi™ coprocessor since its pre-production days, and works closely with Intel engineers in several areas related to performance and stability of codes deployed on the Intel® Architectures.
Description:
The Intel® PCC at Princeton University’s Institute for Computational Science & Engineering in partnership with the TACC and LBNL will focus on conducting a systematic collaborative case study on the Intel® Xeon Phi™ coprocessor of a discovery-science-capable particle-in-cell (PIC) production code named Gyrokinetic Toroidal Code -Princeton (GTC-P). This work will involve exploiting vectorization and determining the best strategy for dealing with the last level of the cache used in Intel® Xeon Phi™ coprocessors. In particular, the associated R&D will explore the best ways to use the memory hierarchy in the Knights Landing (KNL) architecture. Additionally, improved efficiency of the offload programming model on the Knights Corner (KNC) architecture will also be addressed. Overall, the aim is to produce a successful case study to demonstrate the performance of advanced PIC algorithms on Intel® Architectures.
In order to more efficiently utilize the full power of Intel® Xeon Phi™ coprocessors, it is important that the applications utilize all cores and vector units effectively. This will accordingly involve investigation of optimization opportunities on data parallelism for two key kernels in GTC-P featuring algorithmic level “scatter” and gather” operations. Specifically, the optimizations will include careful examination of data layouts (Array of Structure and Structure of Array), data alignment, data prefetching, intrinsics, and auto-vectorization. In addition, the R&D will involve exploring the best strategy for dealing with the last level of the cache hierarchy that is used in the Intel® Xeon Phi™ coprocessor series. Since the KNL architecture soon to be accessible on “Cori” at NERSC/LBNL, on “Theta” at ALCF/ANL, and on “Stampede II” at TACC will feature a hierarchy of dynamic memory capabilities, this Intel® PCC has special interest in analyzing the access pattern of different data structures to guide the allocation to the various dynamic memories. For the current generation KNC architecture featured on “Stampede,” we plan to add an “offload pragma” with the goal of improving offloading of the loops in these key kernels, while keeping nearly the same performance as the native version. Deploying an efficient offload programming model is necessary for properly performing application production runs on leadership-class computing facilities (such as Stampede and TH-2) where supporting direct MPI communication involving Intel® Xeon Phi™ coprocessors is quite challenging.