Leveraging NVLINK and Asynchronous Data Transfer to Scale Beyond the Memory Capacity of GPUs
Workshop: 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems
Abstract: In this paper we demonstrate the utility of fast GPU to CPU interconnects to weak scale on hierarchical nodes without being limited to problem sizes that fit only in the GPU memory capacity. We show the speedup possible for a new regime of algorithms which traditionally have not benefited from being ported to GPUs because of an insufficient amount of computational work relative to bytes of data that must be transferred (offload intensity). This new capability is demonstrated with an example of our hierarchical GPU port of UMT, the 51K line CORAL benchmark application for Lawrence Livermore National Lab's radiation transport code. By overlapping data transfers and using the NVLINK connection between IBM POWER 8 CPUs and NVIDIA P100 GPUs, we demonstrate a speedup that continues even when scaling the problem size well beyond the memory capacity of the GPUs.