Authors
Event Type
Paper
TimeThursday, November 16th2:30pm -
3pm
Location405-406-407
DescriptionMany applications rely on the use of blas/lapack
routines on large groups of very small matrices. For
example, many PDE-based simulations and machine learning
applications require batched blas/lapack routines. While
existing batched blas APIs provide meaningful speedup
over alternatives like OpenMP loops around traditional
blas/lapack, there exists potential for significant
speedup by considering a non-canonical data-layout that
allows for cross-matrix vectorization in batched
blas/lapack routines.
We propose a new compact data-layout that interleaves matrices in blocks according to the architecture’s SIMD vector-length and investigate its merits. Second, we discuss the proposed data-layout in two libraries, an open-source and a vendor implementation. In our experiments, the compact data layout provides up to 5x, 15x and 18x speedup against batched dgemm, dtrsm and dgetrf respectively with a blocksize 5 on Intel Knights Landing. Finally, we demonstrate the improved performance by using the compact data-layout in a line solver for coupled CFD codes.
We propose a new compact data-layout that interleaves matrices in blocks according to the architecture’s SIMD vector-length and investigate its merits. Second, we discuss the proposed data-layout in two libraries, an open-source and a vendor implementation. In our experiments, the compact data layout provides up to 5x, 15x and 18x speedup against batched dgemm, dtrsm and dgetrf respectively with a blocksize 5 on Intel Knights Landing. Finally, we demonstrate the improved performance by using the compact data-layout in a line solver for coupled CFD codes.
Download PDF:
here
Authors