P01: Cache-Blocking Tiling of Large Stencil Codes at Runtime
Abstract: Stencil codes on structured meshes are well-known to be bound by memory bandwidth. Previous research has shown that compiler techniques that reorder loop schedules to improve temporal locality across loop nests, such as tiling, work particularly well. However in large codes the scope of such analysis is limited by the large number of code paths, compilation units, and run-time parameters. We present how, through run-time analysis of data dependencies across stencil loops enables the OPS domain specific language to tile across a large number of different loops. This lets us tackle much larger applications than previously studied: we demonstrate 1.7-3.5x performance improvement on CloverLeaf 2D, CloverLeaf 3D, TeaLeaf and OpenSBLI, tiling across up to 650 subsequent loopnests accessing up to 30 different state variables per gridpoint with up to 46 different stencils. We also demonstrate excellent strong and weak scalability of our approach on up to 4608 Broadwell cores.
Award: Best Poster Finalist (BP): no
Two-page extended abstract: pdf