DescriptionThe advent of multi- and manycore chips has led to a
further opening of the gap between peak and application
performance for many scientific codes. This trend is
accelerating as we move from petascale to exascale.
Paradoxically, bad node-level performance helps to
“efficiently” scale to massive parallelism, but at the
price of increased overall time to solution. If the user
cares about time to solution on any scale, optimal
performance on the node level is often the key factor.
We convey the architectural features of current
processor chips, multiprocessor nodes, and accelerators,
as far as they are relevant for the practitioner.
Peculiarities like SIMD vectorization, shared vs.
separate caches, bandwidth bottlenecks, and ccNUMA
characteristics are introduced, and the influence of
system topology and affinity on the performance of
typical parallel programming constructs is demonstrated.
Performance engineering and performance patterns are
suggested as powerful tools that help the user
understand the bottlenecks at hand and to assess the
impact of possible code optimizations. A cornerstone of
these concepts is the roofline model, which is described
in detail, including useful case studies, limits of its
applicability, and possible refinements.