DescriptionMany applications rely on the use of blas/lapack routines on large groups of very small matrices. For example, many PDE-based simulations and machine learning applications require batched blas/lapack routines. While existing batched blas APIs provide meaningful speedup over alternatives like OpenMP loops around traditional blas/lapack, there exists potential for significant speedup by considering a non-canonical data-layout that allows for cross-matrix vectorization in batched blas/lapack routines.
We propose a new compact data-layout that interleaves matrices in blocks according to the architecture’s SIMD vector-length and investigate its merits. Second, we discuss the proposed data-layout in two libraries, an open-source and a vendor implementation. In our experiments, the compact data layout provides up to 5x, 15x and 18x speedup against batched dgemm, dtrsm and dgetrf respectively with a blocksize 5 on Intel Knights Landing. Finally, we demonstrate the improved performance by using the compact data-layout in a line solver for coupled CFD codes.