P82: Performance Evaluation of the NVIDIA Tesla P100: Our Directive-Based Partitioning and Pipelining vs. NVIDIA’s Unified Memory

Authors: Xuewen Cui (Virginia Tech), Thomas R. W. Scogland (Lawrence Livermore National Laboratory), Bronis R. de Supinski (Lawrence Livermore National Laboratory), Wu-chun Feng (Virginia Tech)

Abstract: We need simpler mechanisms to leverage the performance of accelerators, such as GPUs, in supercomputers. Programming models like OpenMP offer simple-to-use but powerful directive-based offload mechanisms. By default, these models naively copy data to or from the device without overlapping computation. Achieving performance can require extensive hand-tuning to apply optimizations such as pipelining. Users must manually partition data whenever it exceeds device memory. Our directive-based partitioning and pipelining extension for accelerators overlaps data transfers and kernel computation without explicit user data-splitting. We compare a prototype implementation of our extension to NVIDIA's Unified Memory on the Pascal P100 GPU and find that our extension outperforms Unified Memory on average by 68% for data sets that fit into GPU memory and 550% for those that do not.
Award: Best Poster Finalist (BP): no

Poster: pdf
Two-page extended abstract: pdf

Poster Index