P80: Adaptive Loop Scheduling with Charm++ to Improve Performance of Scientific Applications
Abstract: Supercomputers today employ a large number of cores on each node. The Charm++ parallel programming system provides an intelligent runtime which has been highly effective at providing dynamic load balancing across nodes of a supercomputer. Modern multi-core nodes present new challenges and opportunities for Charm++. The large degree of over-decomposition required may lead to high overhead. We modified the Charm++ Runtime System (RTS) to assign Charm++ objects to nodes, thus reducing over-decomposition, and spreading work across cores via parallel loops. We modify a library of the Charm++ software suite that supports loop parallelism by adding to it a loop scheduling strategy that maximizes load balance across cores while minimizing data movement. We tune parameters of the RTS and the loop scheduling strategy to improve performance of benchmark codes run on a variety of architectures. Our technique improves performance of a Particle-in-Cell code run on the Blue Waters supercomputer by 17.2%.
Award: Best Poster Finalist (BP): yes
Two-page extended abstract: pdf