Parallel Programming Languages, Libraries, Models and Notations
TimeMonday, November 13th11am - 11:30am
DescriptionThe latest OpenMP standard offers automatic device offloading capabilities which facilitate GPU programming. Despite this, there remain many challenges. One of these is the unified memory feature introduced in recent GPUs. GPUs in current and future HPC systems have enhanced support for unified memory space. In such systems, CPU and GPU can access each other’s memory transparently, that is, the data movement is managed automatically by the underlying system software and hardware. Memory oversubscription is also possible in these systems. However, there is a significant lack of knowledge about how this mechanism will perform, and how programmers should use it. In this paper, we aim to study and improve the performance of unified memory for automatic GPU offloading via the OpenMP API and runtime, and leveraging the Rodinia benchmark suite. We also modify the LLVM compiler to allow OpenMP to use unified memory. Then we conduct our evaluation on these benchmarks. The results reveal that while the performance of unified memory is comparable with that of normal GPU offloading for benchmarks with little data reuse, it suffers from significant overhead when GPU memory is oversubcribed for benchmarks with large amount of data reuse. Based on these results, we provide several guidelines for programmers to achieve better performance with unified memory.