Redesigning CAM-SE for Peta-Flops Performance on Sunway TaihuLight
Workshop: ATIP Workshop on International Exascale and Next-Generation Computing Programs
Abstract: With radical architectural changes in both the computing architecture and the memory hierarchy for recent leadership supercomputers, it is becoming more and more difficult for well-established numerical codes, such as the millions lines of code in the climate domain, to gain performance benefits. In this talk, we will report our efforts on achieving an efficient utilization of the Sunway TaihuLight for climate-kind applications, such as CAM-SE. We refactored and optimized the complete code using OpenACC directives at the first stage. A more aggressive and finer-grained redesign is then applied on the CAM, to achieve finer memory control and usage, more efficient vectorization and compute and communication overlapping. We further improve the CAM performance of a 260-core Sunway processor to the range of 28 to 184 Intel CPU cores, and achieve a sustainable double-precision performance of 3.3 PFlops for a 750 m global simulation when using 10,075,000 cores. CAM on Sunway achieves the simulation speed of 3.4 and 21.5 simulation-year-per-day (SYPD) for global 25-km and 100-km resolution respectively; and enables us to perform, to our knowledge, the first simulation of the complete lifecycle of hurricane Katrina, and achieve close-to-observation simulation results for both track and intensity.